Towards future synergies: An overview of audiovisual-related projects at the Digital Humanities Conference (DH2019)

By: Liliana Melgar (Utrecht University / The Netherlands Institute for Sound and Vision)

The largest Digital Humanities conference thus far (DH2019) took place in Utrecht from July 8 to 12, 2019. More than one thousand (digital) humanists from all over the world participated in the annual gathering of scholars, curators, librarians, information and computer scientists, publishers, among others, who are incorporating, experimenting, and innovating in digital methods for doing or supporting scholarly work.

The use of digital sources and methods in the humanities (called “digital humanities”) has clear foundations and development in textual domains (such as literary studies or historical research). But, for some years already, the increasing availability of audio-visual materials has started to draw the scholars’ attention to the potential of those sources within fields traditionally dominated by the text, and to see the possibilities of using computational methods in visually-based domains, such as (digital) art history.

In this two-part blog post, I summarize most of the AV-related workshops, panels, and presentations at DH2019, and share some impressions about the sessions that I attended, or which papers I read after the conference. In Part1 I introduce the central workshop (AVinDH). Then, in Part 2, I group the talks into different themes. The entire DH2019 program can be accessed here, and the book of abstracts here (I provide the links to each abstract).

AVinDH SIG Workshop

(see this part of the blog post also at The Netherlands Institute for Sound and Vision blog)

The awareness of the increasing relevance of AV sources for scholarly work led to the idea of founding a special interest group (AVinDH SIG) during the DH Conference in Lausanne in 2014. The group has the aim “to facilitate communication and interaction between researchers from various disciplines including domains such as media studies, history, oral history studies, visual culture studies, social signal processing, archeology, anthropology, linguistics."

On Monday 8 July the fifth AVinDH workshop organized by this interest group took place at DH2019. The workshop, chaired this time by Lauren Tilton (University of Richmond) and Jasmijn van Gorp (Utrecht University), had around 20 participants from domains such as film, television and media studies, cultural history, stylometry, spatial humanities, information and computer science, arts, design and linguistics.

image1 AVinDHgroupImage 1. AVinDH SIG workshop at DH2019, from left to right: John Bell, Dimakatso Mathe, Christian Olesen, Melvin Wevers, Liliana Melgar, Susan Aasman, Mark Williams, Lauren Tilton, Carol Chiodo, David Wrisley, Julia Noordegraaf, Taylor Arnold, Joanna Byszuk, Jasmijn van Gorp, Manuel Burghardt, Nanne van Noord, and Daniel Chavez Heras.

The AVinDH workshop included parallel tutorials and short “lightning” talks. The slides and materials are linked from the workshop’s web page. These were the tutorials:

  • A tutorial by Julia Noordegraaf, Jasmijn van Gorp and myself about using the CLARIAH Media Suite, showing the potential of using the automatic speech transcripts (ASR) that are progressively being added to The Netherlands Institute for Sound and Vision audiovisual collection, and made available to researchers via the Media Suite. The tutorial showed, from a source and tool criticism perspective, how the Media Suite makes it possible to search AV content using the ASR transcripts. Participants were invited to reflect (using the tools provided for metadata inspection) on the consequences of doing research with automatic annotations which are constantly growing (but often “incomplete”), and cannot be 100% accurate.
  • A tutorial by Bernhard Rieder and Thomas Poell (University of Amsterdam) exemplified how to do research about online social media activity based on material produced by public broadcasters. They explained how they extracted and used YouTube data (with one of the tools offered by the Digital methods initiative at the University of Amsterdam) in combination with broadcast programs from The Netherlands Institute for Sound and Vision’s archive, which is made available to researchers via the CLARIAH Media Suite. Their project on the European refugee crisis (from 2013), consisted on finding YouTube clips from public broadcasters by means of matching the automatic speech transcripts, and on the analysis of the related online social media activity.
  • Lauren Tilton and Taylor Arnold’s tutorial on using the “Distant Viewing” Toolkit for the analysis of images using deep learning. The tutorial offered the participants the opportunity to learn the basics of image processing in Python, the concepts of deep learning for images, and to apply the Distant Viewing toolking to moving ImagesT
  • An introduction to the Media Ecology Project (MEP) and a practical hands-on tutorial with the Semantic Annotation Tool (SAT) was given by John Bell and Mark Williams. The participants learned how to easily embed the SAT annotation client, “waldorf.js” plugin (“a drop-in module that facilitates the creation and sharing of time-based media annotations on the Web”) in any website that streams video content. These annotations can be stored and made collaboratively thanks to the SAT “statler” server.  


The AVinDH workshop also included “lightning talks” in which the participants presented their ongoing AV-related research:

  • Manuel Burghardt, from the Computational Humanities Group at Leipzig University, introduced the “Scalable MovieBarcodes,” an exploratory interface for the analysis of movies.
  • Nanne van Noord, from Amsterdam University, described the “Sensory Moving Image Archive” (SEMIA) project, and how they used computer vision to analyse non-verbal syntactic features in moving image material.
  • Joanna Byszuk, a stylometrist from the Institute of Polish Language, introduced her work on Distant reading television, “a stylometry of textual layer of television shows and a few problems related to its representativeness, features, and authorship.”
  • Susan Aasman, media scholar from Groningen University, presented work-in-progress of the research project "Intimate histories; a web-archaeological approach to the early history of YouTube."
  • Melvin Wevers, postdoctoral researcher in the Digital Humanities Lab at the KNAW Humanities Cluster, explained why “drawing boxes” is difficult, showing the challenges of using computer vision for the identification of people’s gender, to be used in the subsequent study of gender representation in newspaper advertisements.
  • Liliana Melgar, from Utrecht University and The Netherlands Institute for Sound and Vision (NISV), also on behalf of Mari Wigham, data scientist from NISV, both working for the CLARIAH project, argued on how the CLARIAH Media Suite’s graphic user interface (GUI) should work in combination with Jupyter Notebooks, facilitating the analysis of audiovisual data in a flexible and transparent way.

Access to AV collections in digital research environments

Automatic speech transcripts (ASR), automatic identification of objects and faces, are tangible and already applied benefits of computer vision to improve access to AV collections provided by archives and broadcasters. The following presentations dealt with “portals,” “platforms,” or “research environments” that combine archival metadata with automatic annotations, plus other annotation facilities to support researchers in the exploration and work with AV collections.

I-Media-Cities: aggregated portal of European film archival collections

The I-Media Cities platform offers access to film archival material from and about the nine participating cities (Athens, Barcelona, Bologna, Brussels, Copenhagen, Frankfurt, Stockholm, Turin, and Vienna). The platform is the result of a H2020 project which started in 2016 and finished this year. The presentation focused on explaining the project and the functionality of the platform that gives access to these collections to researchers and the general public. A positive aspect of the platform is that the copyright status of each media item is clearly specified and used for filtering. This is useful since it gives the creators of this aggregated platform the flexibility of having an open platform, and still lock certain content for viewing (in those cases, content access has to be requested by the user to the provider). This approach to access AV collections which are heavily restricted by copyright is different in the CLARIAH Media Suite, mentioned below.

Finding and discovering AV content via the I-Media Cities platform is facilitated by a combination of manually-generated metadata (provided by the archives) and automatically generated metadata (mostly using shot-boundary detection and object detection algorithms that were obtained in collaboration with the Fraunhofer Institute). User annotations can be added to an individual shot or, if the user chooses to annotate a fragment, to the group of shots that compose the fragment (see image below). Those annotations can be tags, geotags, notes, links, or references. The automatic annotations (e.g., the label “single person”) are added to each shot together to the user-created annotations.

The fact that the annotations are added and displayed per shot invites a high level of granularity (see the purple and orange dots in the image below). However, one wonders whether this approach will suit the needs of researchers (and the general public) who need to annotate fragments that unfold temporally, and not the “static phenomena deduced from individual frames,” as it was argued during the panel “Between Data Mining and Human Experience – Digital Approaches to Film, Television and Video Game Analysis” organized by the CinePoetics Center for advanced Film Studies. I hope that future collaborations between IMedia Cities and other groups of researchers investigating scholarly annotations of moving images will facilitate sharing expertise that reverts in the benefit of the user community of these valuable audiovisual heritage.

image2 screenshotImediaCitiesImage2. Screenshot from “I Media Cities” platform (July, 2019).

The project ended this year, but the consortium will take care of the sustainability of the platform, also inviting other institutions to become content providers. It was also mentioned during the presentation that user participation via crowdsourcing is envisioned, but from the presentation it was not clear yet which approach will be used for user engagement and for keeping the connection between the workflows for user and automatic annotations. The code of this platform will be made available open source, as it is announced in the project’s website.

The CLARIAH Media Suite: transparent research support

The CLARIAH Media Suite, a research environment built to facilitate access and research to important audio-visual archives in The Netherlands, was presented at the AVinDH workshop (synopsis above), and in a paper at the Panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software.” In the panel’s paper, Jasmijn van Gorp (also on behalf of her co-authors) introduced the project focusing on the participation of “users” in a “co-development” approach for building a sustainable infrastructure. The paper shows examples that we used in this project to involve scholars in the process of incorporating collections and building functionalities to work with them. The Media Suite has, for the first time, provided online access to the entire audiovisual collection of The Netherlands Institute for Sound and Vision. In the context of the CLARIAH Core project, workflows are in place to progressively generate automatic transcripts from the audio signal. Within the CLARIAH Plus project, which is just starting, other automatic annotations, and other audiovisual collections will be made available via the Media Suite. To incorporate these collections in a way that they can be used by researchers, their constant and active participation is required, which also demands for more innovative user-centered design approaches. Our presentation is available here.

Computer vision to improve access to the BBC and BFI collections

In the paper “Computer Vision Software for AV Search and Discovery” presented at the panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software,” Giles Bergel, from the Visual Geometry Group (VGG) at Oxford University, introduced ongoing collaborative efforts for integrating automatic annotations in the collection management of the BBC and the British Film Institute, mostly focused on providing access to and experimenting with facial recognition across large datasets.

In the same vein as the Media Ecology project, the presenter highlights the need for creating an “ecosystem,” (in terms of integrative data flows and collaboration between institutions) in which archival metadata and automatic annotations can improve reciprocally. The presenter also proposes a “multimodal approach to search,” which benefits from a machine learning applied to exploiting the correspondences between the audio and visual content of videos. This research has resulted in a live demo to perform visual search of BBC news based on objects, persons, text, and query by example. More live demos of image search in important collections from the VGG group are available on their website!

image3 screenshotBBCNewsSearchImage 3. Screenshot from “BBC News Search” powered by Oxford’s Visual Geometry Group (July, 2019)

Boldly computational

  • The presentation “Seen by Machine: Computational Spectatorship in the BBC television archive” by Daniel Martinez Heras (also on behalf of his co-authors) showed and reflected upon the BBC’s “Made by machine” project, which used machine learning to produce a new program based on footage extracted from the BBC archives. Both the selection of the clips and their processing used machine learning approaches.The automatically-generated program, which the author sees as a promising connection with contemporary aleatory music, was broadcast on BBC Four in September 2018. The presenter drove his attention to the comments added by the spectators of this machine-made program. Daniel showed the negative, confused, but also sometimes enthusiastic and enlightening feedback, all of which he constructively uses to build the concept of computational spectatorship: “a way to understand how our visual regimes are increasingly mediated by machine-seers.”
  • Lauren Tilton and Taylor Arnold presented their “Distant viewing” project and toolkit for using deep learning in the analysis of images. This is a software library that enables research into moving images via the automatic detection of stylistic elements (colour, lightning), structural elements (shot boundaries), content elements (object and face identification), sound elements, and transcripts of spoken word. Lauren and Taylor gave a tutorial at the AVinDH workshop, and a paper presentation at the panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software.”
  • “Early Modern Computer Vision”’s paper by Leonardo Laurence Impett, shows an intriguing and exciting alternative angle to the way current computer vision is understood in artificial intelligence. He revisits historical theories of vision and early experiments. He also experiments with those theories by presenting a prototype based on Italian theories of optics, vision and visual art of the 16th century. In my view, this is a great example on how critical experimentation can be applied in humanities scholarship, by testing the interaction between foundational premises and the way systems work. This idea is aligned with Geoffrey Rockwell’s call during his talk “Zombies as tools” at the DLS workshop for more replication-like experiments with early tools.
  • “Deep Watching: Towards New Methods of Analyzing Visual Media in Cultural Studies”, by Bernhard Bermeitinger and others, discusses two examples of using computational methods for the identification of objects, symbols and “persons, and their mimics and postures” in YouTube videos; and to the analysis of trading cards of the actress Marlene Dietrich.

Assisted AV manual and semi-automatic annotation

  • Film scholar Mark Williams and his team named his project “Media Ecology” (MEP). This name conveys the idea of how the need to train computer vision algorithms, both for curatorial and scholarly use, has created “collaborative synergies” between archives and researchers, to which annotation is a central activity. The MEP group of tools that facilitates this “ecology” in the creation and exchange between scholarly annotations and the AV media archives are: the Semantic Annotation Tool (SAT) for semantic annotation; for vocabulary sharing; and the “Machine Vision” prototype for searching automatic annotations.
  • The panel “Between Data Mining and Human Experience – Digital Approaches to Film, Television and Video Game Analysis” organized by the CinePoetics Center for advanced Film Studies at the Freie Universität Berlin gave prominent attention to the methodological aspects of annotating “temporal arts and media” (film, television, web videos, and video games) for scholarly purposes, and to the practical implementations they have been working on. The presenters challenge existing qualitative methodologies in the humanities, which may not be suitable for the study of aesthetic and compositional patterns in temporally-based media, since they focus on the “isolation of features,” while aesthetic analyses have a more integrative perspective. One of the papers (“Researching and Annotating Audiovisual Patterns – Methodological Considerations”) diggs into the requirements for annotation tools that would facilitate annotation practices rooted in film theory and qualitative research altogether, showing their experience with the video annotation tool ADVENE. To tackle the issue of the high time investment that these fine-grained annotation of films requires (“involving several hours of work per minute of film”), the CinePoetics team has worked in developing and integrating a controlled vocabulary and ontology of film analytical concepts into the annotation tool (see the ADA ontology project, ADA public Github repository; and ontology in Ontoviz), and on developing semi-automatic classification of audiovisual patterns (for shot detection, colour range detection, automatic speech recognition, visual concept detection, and image search support). This systematic approach to annotation, in combination with a tool that supports a layered annotation, a scholarly-based ontology, and the combination of automatic and manual annotations, results in very impressive visualizations of what the presenters call “film scores” (image below). These annotations are published as linked open data in the sparql end point. The interactive querying and visualization of the semantic annotations, plus the viewing of the semantically annotated videos can be done via the ADA annotation Explorer, developed with the web-based, open source, hypervideo software FrameTrail.
  • The presentation by Michael Bourgatte (Institut Catholique de Paris), “Prendre en Compte le Contexte d’Usage et le Contexte Technique dans le Développement du Service d’Annotation Vidéo Cellulloid” showed the manual video annotation tool (Cellulloid) that was developed within a socio-constructivist approach to support annotation, as an essential activity for “active engagement” with audiovisual sources in education and research contexts. This work was inspired by Lignes de Temps, and other relevant annotation tools used in film or performative analyses, proposing, however, a different approach to the display of the annotations (not as separate from the video, but integrated within it).

 image4 advene screenshotImage4. “Film scores”: different levels of annotations, screenshot from ADVENE (taken from paper published at: (July, 2019).

Doing history with recorded interviews

Oral historians have relied in the past, to a great extent, on the analysis of the manual transcriptions of the audio or video recordings of the interviews they conduct as part of their investigations. But AV technologies are bringing new opportunities to doing history with digital AV sources. At DH2019 oral historians were well present with two workshops and a presentation:

  • The workshop “A transcription portal for oral history research and beyond”, organized by Henk van den Heuvel, and coauthors, introduced the participants to the recently launched transcription chain prototype, the “T-Chain portal.” This portal, built with together with oral historians, linguists and researchers from other disciplines interested in doing research with and about automatic speech recognition, allows researchers to upload their audio recorded interviews, then use the open source automatic speech recognition software available for the language of the interview (English, Dutch, German, Italian), and to correct and export the automatically generated transcript. 
  • The workshop “Oral history: A Multidisciplinary Approach To The Use Of Technology In Research: The Case Of Interview Data” organized by Arjan van Hessen and coauthors, presented the work done in a previous series of workshops, supported by CLARIN on this topic ( These workshops had a focus on the “multidisciplinary potential of interview data (history, oral and written language, audio-visual communication),” and on seeking synergy between the different methods and tools to work with AV data used in different disciplines. The focus of the DH workshop was also, along those lines, on sharing experiences about the organization and conclusions of these series of workshops, and on developing skills of participants working with digital tools to study interviews.
  • The presentation by Norah Karrouche “Still Waters Run Deep. Including Minority Voices in the Oral History Archive Through Digital Practice” critically reflects upon three aspects: 1) the content/focus of the oral history projects conducted in the past two decades in The Netherlands, which have given priority to WWII memories, excluding other underrepresented topics and groups; 2) the lack of integration of digital methods in oral history, due to a neglect by other disciplines, but also within the discipline itself, about the validity and usefulness of oral history and digital methods, which are only marginally incorporated at universities in The Netherlands; 3) the difficulties and lack of awareness among oral historians about the different regulations that could make more oral history collections open. Norah combines these reflections into a proposal that seeks cooperation between CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities), a large-scale digital humanities research infrastructure project in the Netherlands, and the small community archive Verhalenhuis Belvédère in Rotterdam. The aim is to analyze and improve the workflows that could make community archives more open to researchers, and how they can benefit from existing and in development tools that facilitate research with AV data. There will be a workshop after DH where Norah will bring all these ideas into practice.

Other related presentations

I have listed and commented about some of the most explicitly AV-related contributions at DH2019, but there were several other papers which indirectly dealt with the topic, for example:

Concluding remarks

image5 tweet Image5. Tweet by Martijn Kleppe (July 12, 2019), post presented a comprehensive (but certainly incomplete!) summary of the main AV-related contributions at the DH2019 conference. The amount and breadth of the projects listed here, many of which have adopted computational methods in either a basic or more adventurous way, is the explicit sign of “a turn toward the visual in digital humanities research,” about which historians Melvin Wevers and Thomas Smith, innovators in doing computationally-based historical research on images of digitized Dutch newspapers wrote a significant contribution (Wevers and Smith, 2019).

An obvious conclusion from this overview is the great potential for “collaborative synergies” (as film scholar Mark Williams likes to emphasize) between the discussed AV-based projects in scholarly research and digital curation. As most of these projects have the annotation scholarly primitive as a basis, more tools will be developed, with different “ergonomics” adapted to their user groups (e.g., to the needs of media, television, performing scholars, oral historians, linguists, or curators). To achieve stronger synergies, sharing these annotations will become more urgent, which calls for an interoperability framework for sharing and reusing scholarly annotations. This framework doesn't necessarily have to be based only on sharing the vocabularies used for annotation (since these are discipline-dependent, as we saw in the ADA ontology project described here). A more discipline-independent approach exists in which by using the W3C annotation model it would be possible to “enable annotations to be shared and reused across different hardware and software platforms.” With the aim to explore this further, we have promoted, within CLARIAH, the creation of an expert interest group of developers of AV-annotation tools, called VAINT, which stands for Video (time-based media) Annotation Interoperability iNterest Group. This group works on investigating how to adapt the generalistic W3C annotation model for sharing scholarly annotations, with a focus on time-based media, also looking for synergies with the IIIF interoperability framework.

Looking forward to seeing more humanities research using the wealth of increasingly digitally available sound and audio-visual archives at DH2020 in Ottawa!