By: Liliana Melgar (Utrecht University / The Netherlands Institute for Sound and Vision)

The largest Digital Humanities conference thus far (DH2019) took place in Utrecht from July 8 to 12, 2019. More than one thousand (digital) humanists from all over the world participated in the annual gathering of scholars, curators, librarians, information and computer scientists, publishers, among others, who are incorporating, experimenting, and innovating in digital methods for doing or supporting scholarly work.

The use of digital sources and methods in the humanities (called “digital humanities”) has clear foundations and development in textual domains (such as literary studies or historical research). But, for some years already, the increasing availability of audio-visual materials has started to draw the scholars’ attention to the potential of those sources within fields traditionally dominated by the text, and to see the possibilities of using computational methods in visually-based domains, such as (digital) art history.

In this two-part blog post, I summarize most of the AV-related workshops, panels, and presentations at DH2019, and share some impressions about the sessions that I attended, or which papers I read after the conference. In Part1 I introduce the central workshop (AVinDH). Then, in Part 2, I group the talks into different themes. The entire DH2019 program can be accessed here, and the book of abstracts here (I provide the links to each abstract).

AVinDH SIG Workshop

(see this part of the blog post also at The Netherlands Institute for Sound and Vision blog)

The awareness of the increasing relevance of AV sources for scholarly work led to the idea of founding a special interest group (AVinDH SIG) during the DH Conference in Lausanne in 2014. The group has the aim “to facilitate communication and interaction between researchers from various disciplines including domains such as media studies, history, oral history studies, visual culture studies, social signal processing, archeology, anthropology, linguistics."

On Monday 8 July the fifth AVinDH workshop organized by this interest group took place at DH2019. The workshop, chaired this time by Lauren Tilton (University of Richmond) and Jasmijn van Gorp (Utrecht University), had around 20 participants from domains such as film, television and media studies, cultural history, stylometry, spatial humanities, information and computer science, arts, design and linguistics.

image1 AVinDHgroupImage 1. AVinDH SIG workshop at DH2019, from left to right: John Bell, Dimakatso Mathe, Christian Olesen, Melvin Wevers, Liliana Melgar, Susan Aasman, Mark Williams, Lauren Tilton, Carol Chiodo, David Wrisley, Julia Noordegraaf, Taylor Arnold, Joanna Byszuk, Jasmijn van Gorp, Manuel Burghardt, Nanne van Noord, and Daniel Chavez Heras.

The AVinDH workshop included parallel tutorials and short “lightning” talks. The slides and materials are linked from the workshop’s web page. These were the tutorials:

  • A tutorial by Julia Noordegraaf, Jasmijn van Gorp and myself about using the CLARIAH Media Suite, showing the potential of using the automatic speech transcripts (ASR) that are progressively being added to The Netherlands Institute for Sound and Vision audiovisual collection, and made available to researchers via the Media Suite. The tutorial showed, from a source and tool criticism perspective, how the Media Suite makes it possible to search AV content using the ASR transcripts. Participants were invited to reflect (using the tools provided for metadata inspection) on the consequences of doing research with automatic annotations which are constantly growing (but often “incomplete”), and cannot be 100% accurate.
  • A tutorial by Bernhard Rieder and Thomas Poell (University of Amsterdam) exemplified how to do research about online social media activity based on material produced by public broadcasters. They explained how they extracted and used YouTube data (with one of the tools offered by the Digital methods initiative at the University of Amsterdam) in combination with broadcast programs from The Netherlands Institute for Sound and Vision’s archive, which is made available to researchers via the CLARIAH Media Suite. Their project on the European refugee crisis (from 2013), consisted on finding YouTube clips from public broadcasters by means of matching the automatic speech transcripts, and on the analysis of the related online social media activity.
  • Lauren Tilton and Taylor Arnold’s tutorial on using the “Distant Viewing” Toolkit for the analysis of images using deep learning. The tutorial offered the participants the opportunity to learn the basics of image processing in Python, the concepts of deep learning for images, and to apply the Distant Viewing toolking to moving ImagesT
  • An introduction to the Media Ecology Project (MEP) and a practical hands-on tutorial with the Semantic Annotation Tool (SAT) was given by John Bell and Mark Williams. The participants learned how to easily embed the SAT annotation client, “waldorf.js” plugin (“a drop-in module that facilitates the creation and sharing of time-based media annotations on the Web”) in any website that streams video content. These annotations can be stored and made collaboratively thanks to the SAT “statler” server.  

 

The AVinDH workshop also included “lightning talks” in which the participants presented their ongoing AV-related research:

  • Manuel Burghardt, from the Computational Humanities Group at Leipzig University, introduced the “Scalable MovieBarcodes,” an exploratory interface for the analysis of movies.
  • Nanne van Noord, from Amsterdam University, described the “Sensory Moving Image Archive” (SEMIA) project, and how they used computer vision to analyse non-verbal syntactic features in moving image material.
  • Joanna Byszuk, a stylometrist from the Institute of Polish Language, introduced her work on Distant reading television, “a stylometry of textual layer of television shows and a few problems related to its representativeness, features, and authorship.”
  • Susan Aasman, media scholar from Groningen University, presented work-in-progress of the research project "Intimate histories; a web-archaeological approach to the early history of YouTube."
  • Melvin Wevers, postdoctoral researcher in the Digital Humanities Lab at the KNAW Humanities Cluster, explained why “drawing boxes” is difficult, showing the challenges of using computer vision for the identification of people’s gender, to be used in the subsequent study of gender representation in newspaper advertisements.
  • Liliana Melgar, from Utrecht University and The Netherlands Institute for Sound and Vision (NISV), also on behalf of Mari Wigham, data scientist from NISV, both working for the CLARIAH project, argued on how the CLARIAH Media Suite’s graphic user interface (GUI) should work in combination with Jupyter Notebooks, facilitating the analysis of audiovisual data in a flexible and transparent way.

Access to AV collections in digital research environments

Automatic speech transcripts (ASR), automatic identification of objects and faces, are tangible and already applied benefits of computer vision to improve access to AV collections provided by archives and broadcasters. The following presentations dealt with “portals,” “platforms,” or “research environments” that combine archival metadata with automatic annotations, plus other annotation facilities to support researchers in the exploration and work with AV collections.

I-Media-Cities: aggregated portal of European film archival collections

The I-Media Cities platform offers access to film archival material from and about the nine participating cities (Athens, Barcelona, Bologna, Brussels, Copenhagen, Frankfurt, Stockholm, Turin, and Vienna). The platform is the result of a H2020 project which started in 2016 and finished this year. The presentation focused on explaining the project and the functionality of the platform that gives access to these collections to researchers and the general public. A positive aspect of the platform is that the copyright status of each media item is clearly specified and used for filtering. This is useful since it gives the creators of this aggregated platform the flexibility of having an open platform, and still lock certain content for viewing (in those cases, content access has to be requested by the user to the provider). This approach to access AV collections which are heavily restricted by copyright is different in the CLARIAH Media Suite, mentioned below.

Finding and discovering AV content via the I-Media Cities platform is facilitated by a combination of manually-generated metadata (provided by the archives) and automatically generated metadata (mostly using shot-boundary detection and object detection algorithms that were obtained in collaboration with the Fraunhofer Institute). User annotations can be added to an individual shot or, if the user chooses to annotate a fragment, to the group of shots that compose the fragment (see image below). Those annotations can be tags, geotags, notes, links, or references. The automatic annotations (e.g., the label “single person”) are added to each shot together to the user-created annotations.

The fact that the annotations are added and displayed per shot invites a high level of granularity (see the purple and orange dots in the image below). However, one wonders whether this approach will suit the needs of researchers (and the general public) who need to annotate fragments that unfold temporally, and not the “static phenomena deduced from individual frames,” as it was argued during the panel “Between Data Mining and Human Experience – Digital Approaches to Film, Television and Video Game Analysis” organized by the CinePoetics Center for advanced Film Studies. I hope that future collaborations between IMedia Cities and other groups of researchers investigating scholarly annotations of moving images will facilitate sharing expertise that reverts in the benefit of the user community of these valuable audiovisual heritage.

image2 screenshotImediaCitiesImage2. Screenshot from “I Media Cities” platform (July, 2019).

The project ended this year, but the consortium will take care of the sustainability of the platform, also inviting other institutions to become content providers. It was also mentioned during the presentation that user participation via crowdsourcing is envisioned, but from the presentation it was not clear yet which approach will be used for user engagement and for keeping the connection between the workflows for user and automatic annotations. The code of this platform will be made available open source, as it is announced in the project’s website.

The CLARIAH Media Suite: transparent research support

The CLARIAH Media Suite, a research environment built to facilitate access and research to important audio-visual archives in The Netherlands, was presented at the AVinDH workshop (synopsis above), and in a paper at the Panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software.” In the panel’s paper, Jasmijn van Gorp (also on behalf of her co-authors) introduced the project focusing on the participation of “users” in a “co-development” approach for building a sustainable infrastructure. The paper shows examples that we used in this project to involve scholars in the process of incorporating collections and building functionalities to work with them. The Media Suite has, for the first time, provided online access to the entire audiovisual collection of The Netherlands Institute for Sound and Vision. In the context of the CLARIAH Core project, workflows are in place to progressively generate automatic transcripts from the audio signal. Within the CLARIAH Plus project, which is just starting, other automatic annotations, and other audiovisual collections will be made available via the Media Suite. To incorporate these collections in a way that they can be used by researchers, their constant and active participation is required, which also demands for more innovative user-centered design approaches. Our presentation is available here.

Computer vision to improve access to the BBC and BFI collections

In the paper “Computer Vision Software for AV Search and Discovery” presented at the panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software,” Giles Bergel, from the Visual Geometry Group (VGG) at Oxford University, introduced ongoing collaborative efforts for integrating automatic annotations in the collection management of the BBC and the British Film Institute, mostly focused on providing access to and experimenting with facial recognition across large datasets.

In the same vein as the Media Ecology project, the presenter highlights the need for creating an “ecosystem,” (in terms of integrative data flows and collaboration between institutions) in which archival metadata and automatic annotations can improve reciprocally. The presenter also proposes a “multimodal approach to search,” which benefits from a machine learning applied to exploiting the correspondences between the audio and visual content of videos. This research has resulted in a live demo to perform visual search of BBC news based on objects, persons, text, and query by example. More live demos of image search in important collections from the VGG group are available on their website!

image3 screenshotBBCNewsSearchImage 3. Screenshot from “BBC News Search” powered by Oxford’s Visual Geometry Group (July, 2019)

Boldly computational

  • The presentation “Seen by Machine: Computational Spectatorship in the BBC television archive” by Daniel Martinez Heras (also on behalf of his co-authors) showed and reflected upon the BBC’s “Made by machine” project, which used machine learning to produce a new program based on footage extracted from the BBC archives. Both the selection of the clips and their processing used machine learning approaches.The automatically-generated program, which the author sees as a promising connection with contemporary aleatory music, was broadcast on BBC Four in September 2018. The presenter drove his attention to the comments added by the spectators of this machine-made program. Daniel showed the negative, confused, but also sometimes enthusiastic and enlightening feedback, all of which he constructively uses to build the concept of computational spectatorship: “a way to understand how our visual regimes are increasingly mediated by machine-seers.”
  • Lauren Tilton and Taylor Arnold presented their “Distant viewing” project and toolkit for using deep learning in the analysis of images. This is a software library that enables research into moving images via the automatic detection of stylistic elements (colour, lightning), structural elements (shot boundaries), content elements (object and face identification), sound elements, and transcripts of spoken word. Lauren and Taylor gave a tutorial at the AVinDH workshop, and a paper presentation at the panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software.”
  • “Early Modern Computer Vision”’s paper by Leonardo Laurence Impett, shows an intriguing and exciting alternative angle to the way current computer vision is understood in artificial intelligence. He revisits historical theories of vision and early experiments. He also experiments with those theories by presenting a prototype based on Italian theories of optics, vision and visual art of the 16th century. In my view, this is a great example on how critical experimentation can be applied in humanities scholarship, by testing the interaction between foundational premises and the way systems work. This idea is aligned with Geoffrey Rockwell’s call during his talk “Zombies as tools” at the DLS workshop for more replication-like experiments with early tools.
  • “Deep Watching: Towards New Methods of Analyzing Visual Media in Cultural Studies”, by Bernhard Bermeitinger and others, discusses two examples of using computational methods for the identification of objects, symbols and “persons, and their mimics and postures” in YouTube videos; and to the analysis of trading cards of the actress Marlene Dietrich.

Assisted AV manual and semi-automatic annotation

  • Film scholar Mark Williams and his team named his project “Media Ecology” (MEP). This name conveys the idea of how the need to train computer vision algorithms, both for curatorial and scholarly use, has created “collaborative synergies” between archives and researchers, to which annotation is a central activity. The MEP group of tools that facilitates this “ecology” in the creation and exchange between scholarly annotations and the AV media archives are: the Semantic Annotation Tool (SAT) for semantic annotation; Onomy.org for vocabulary sharing; and the “Machine Vision” prototype for searching automatic annotations.
  • The panel “Between Data Mining and Human Experience – Digital Approaches to Film, Television and Video Game Analysis” organized by the CinePoetics Center for advanced Film Studies at the Freie Universität Berlin gave prominent attention to the methodological aspects of annotating “temporal arts and media” (film, television, web videos, and video games) for scholarly purposes, and to the practical implementations they have been working on. The presenters challenge existing qualitative methodologies in the humanities, which may not be suitable for the study of aesthetic and compositional patterns in temporally-based media, since they focus on the “isolation of features,” while aesthetic analyses have a more integrative perspective. One of the papers (“Researching and Annotating Audiovisual Patterns – Methodological Considerations”) diggs into the requirements for annotation tools that would facilitate annotation practices rooted in film theory and qualitative research altogether, showing their experience with the video annotation tool ADVENE. To tackle the issue of the high time investment that these fine-grained annotation of films requires (“involving several hours of work per minute of film”), the CinePoetics team has worked in developing and integrating a controlled vocabulary and ontology of film analytical concepts into the annotation tool (see the ADA ontology project, ADA public Github repository; and ontology in Ontoviz), and on developing semi-automatic classification of audiovisual patterns (for shot detection, colour range detection, automatic speech recognition, visual concept detection, and image search support). This systematic approach to annotation, in combination with a tool that supports a layered annotation, a scholarly-based ontology, and the combination of automatic and manual annotations, results in very impressive visualizations of what the presenters call “film scores” (image below). These annotations are published as linked open data in the sparql end point. The interactive querying and visualization of the semantic annotations, plus the viewing of the semantically annotated videos can be done via the ADA annotation Explorer, developed with the web-based, open source, hypervideo software FrameTrail.
  • The presentation by Michael Bourgatte (Institut Catholique de Paris), “Prendre en Compte le Contexte d’Usage et le Contexte Technique dans le Développement du Service d’Annotation Vidéo Cellulloid” showed the manual video annotation tool (Cellulloid) that was developed within a socio-constructivist approach to support annotation, as an essential activity for “active engagement” with audiovisual sources in education and research contexts. This work was inspired by Lignes de Temps, and other relevant annotation tools used in film or performative analyses, proposing, however, a different approach to the display of the annotations (not as separate from the video, but integrated within it).

 image4 advene screenshotImage4. “Film scores”: different levels of annotations, screenshot from ADVENE (taken from paper published at: https://dev.clariah.nl/files/dh2019/boa/0537.html) (July, 2019).

Doing history with recorded interviews

Oral historians have relied in the past, to a great extent, on the analysis of the manual transcriptions of the audio or video recordings of the interviews they conduct as part of their investigations. But AV technologies are bringing new opportunities to doing history with digital AV sources. At DH2019 oral historians were well present with two workshops and a presentation:

  • The workshop “A transcription portal for oral history research and beyond”, organized by Henk van den Heuvel, and coauthors, introduced the participants to the recently launched transcription chain prototype, the “T-Chain portal.” This portal, built with together with oral historians, linguists and researchers from other disciplines interested in doing research with and about automatic speech recognition, allows researchers to upload their audio recorded interviews, then use the open source automatic speech recognition software available for the language of the interview (English, Dutch, German, Italian), and to correct and export the automatically generated transcript. 
  • The workshop “Oral history: A Multidisciplinary Approach To The Use Of Technology In Research: The Case Of Interview Data” organized by Arjan van Hessen and coauthors, presented the work done in a previous series of workshops, supported by CLARIN on this topic (https://oralhistory.eu/workshops). These workshops had a focus on the “multidisciplinary potential of interview data (history, oral and written language, audio-visual communication),” and on seeking synergy between the different methods and tools to work with AV data used in different disciplines. The focus of the DH workshop was also, along those lines, on sharing experiences about the organization and conclusions of these series of workshops, and on developing skills of participants working with digital tools to study interviews.
  • The presentation by Norah Karrouche “Still Waters Run Deep. Including Minority Voices in the Oral History Archive Through Digital Practice” critically reflects upon three aspects: 1) the content/focus of the oral history projects conducted in the past two decades in The Netherlands, which have given priority to WWII memories, excluding other underrepresented topics and groups; 2) the lack of integration of digital methods in oral history, due to a neglect by other disciplines, but also within the discipline itself, about the validity and usefulness of oral history and digital methods, which are only marginally incorporated at universities in The Netherlands; 3) the difficulties and lack of awareness among oral historians about the different regulations that could make more oral history collections open. Norah combines these reflections into a proposal that seeks cooperation between CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities), a large-scale digital humanities research infrastructure project in the Netherlands, and the small community archive Verhalenhuis Belvédère in Rotterdam. The aim is to analyze and improve the workflows that could make community archives more open to researchers, and how they can benefit from existing and in development tools that facilitate research with AV data. There will be a workshop after DH where Norah will bring all these ideas into practice.

Other related presentations

I have listed and commented about some of the most explicitly AV-related contributions at DH2019, but there were several other papers which indirectly dealt with the topic, for example:

Concluding remarks

image5 tweet Image5. Tweet by Martijn Kleppe (July 12, 2019), https://twitter.com/MartijnKleppe/status/1149659388988145665This post presented a comprehensive (but certainly incomplete!) summary of the main AV-related contributions at the DH2019 conference. The amount and breadth of the projects listed here, many of which have adopted computational methods in either a basic or more adventurous way, is the explicit sign of “a turn toward the visual in digital humanities research,” about which historians Melvin Wevers and Thomas Smith, innovators in doing computationally-based historical research on images of digitized Dutch newspapers wrote a significant contribution (Wevers and Smith, 2019).

An obvious conclusion from this overview is the great potential for “collaborative synergies” (as film scholar Mark Williams likes to emphasize) between the discussed AV-based projects in scholarly research and digital curation. As most of these projects have the annotation scholarly primitive as a basis, more tools will be developed, with different “ergonomics” adapted to their user groups (e.g., to the needs of media, television, performing scholars, oral historians, linguists, or curators). To achieve stronger synergies, sharing these annotations will become more urgent, which calls for an interoperability framework for sharing and reusing scholarly annotations. This framework doesn't necessarily have to be based only on sharing the vocabularies used for annotation (since these are discipline-dependent, as we saw in the ADA ontology project described here). A more discipline-independent approach exists in which by using the W3C annotation model it would be possible to “enable annotations to be shared and reused across different hardware and software platforms.” With the aim to explore this further, we have promoted, within CLARIAH, the creation of an expert interest group of developers of AV-annotation tools, called VAINT, which stands for Video (time-based media) Annotation Interoperability iNterest Group. This group works on investigating how to adapt the generalistic W3C annotation model for sharing scholarly annotations, with a focus on time-based media, also looking for synergies with the IIIF interoperability framework.

Looking forward to seeing more humanities research using the wealth of increasingly digitally available sound and audio-visual archives at DH2020 in Ottawa!

Best paper award at DH Benelux 2019 for paper on temporal exploration of audiovisual sources

During the Digital Humanities Benelux 2019 conference in Liège, a paper created in the context of CLARIAH received the best paper award. This paper looked at novel and visual ways to support scholars' interpretation of audiovisual sources through temporal content exploration.

From 11 to 13 September 2019, a large number of Digital Humanities (DH) researchers from Belgium, Luxembourg and the Netherlands came together in Liège for the annual DH Benelux conference. On the basis of peer reviews conducted for DH Benelux 2019, five papers were nominated for the best paper award. According to the jury, two papers stood "head and shoulders above the rest", and therefore two best paper awards were given. One was bestowed upon Gerben Zaagsma and his paper Digital History and the Politics of Digitization. The other best paper award went to Hugo Huurdeman, Liliana Melgar, Roeland Ordelman and Julia Noordegraaf, for a paper created in the context of CLARIAH's Media Studies work package. This paper was entitled Supporting the Interpretation of Enriched Audiovisual Sources through Temporal Content Exploration.

The paper by Huurdeman et al. describes findings of the ReVI project, a pilot looking at enhancing the Resource Viewer of the CLARIAH Media Suite, where audiovisual materials can be played. Specifically, the ReVI project looked at optimal ways "to support the exploration of different types of content metadata of audiovisual sources, such as segment information or automatic transcripts." During the project, various design thinking sessions were conducted, and a prototype including temporal content visualizations of audiovisual materials was created and evaluated in a user study.

The findings of the user study showed a clear value of temporal visualizations and advanced annotation features for research purposes, as well as the continued importance of a data and tool criticism approach. New content exploration tools can benefit scholars doing research with audiovisual sources, for instance in media studies, oral history, film studies, and other disciplines which are increasingly using audiovisual media. The findings documented in the DH Benelux 2019 paper may serve as an inspiration for improving AV-media-based research tools. Concretely, it will also inform the further enhancement of the Resource Viewer of the CLARIAH Media Suite.

The conference presentation is available on SlideShare, and the paper abstract via the DH Benelux conference website.

bpa

Blog post written by Jan Odijk (Utrecht University, CLARIAH NL)

the original blog was posted at the CLARIN ERIC website

On 23 and 24 May the CLARIN ParlaFormat workshop was held in Amersfoort, the Netherlands. This workshop was organized by the CLARIN Interoperability Committee, a subcommittee of CLARIN’s National Coordinators’ Forum.

parlaformatParticipants at the RCE in Amersfoort There were 25 participants from 13 different countries, as well as participants from the CLARIN Interoperability  Committee and from the CLARIN ERIC Board.

The goal of the workshop was to present an outline of a standard format (proposed name: parla-CLARIN, a subset of TEI) for parliamentary data to the research community, to assess the support for it, and to identify potential or real problems for its development and wide adoption. This proposal was prepared and presented by Tomaž Erjavec and Andrej Pančur (from CLARIN Slovenia).

The participants presented the formats they currently work with, indicated which aspects of these formats are important for them, and inquired whether these are covered by the new proposal. There was a very good and constructive atmosphere during the whole workshop, active contributions by all participants and lively discussions. In particular, there were extensive discussions on the existing standard Akoma Ntoso, which is in use in various parliaments, and what the relation should be between it and the newly proposed format.

DinnerJoint dinner at Sally's Indonesian Kitchen & Restaurant in Amersfoort

After the presentations by the various participants, Tomaž Erjavec and Andrej Pančur responded to some of the questions about and criticisms on parla-CLARIN, and how they will address these in the coming period.

The parla-CLARIN format will now be further developed. Information about it and contributions to it can be provided via Github. All participants will upload samples of their data there so that these can be taken into consideration during the further development of parla-CLARIN.

In about three months a revised version of parla-CLARIN will be made public, and we plan to organize a follow-up workshop with a shared task: all participants will convert their data into the parla-CLARIN format, report on problems encountered and share conversion scripts. If that workshop is successful, we plan to organize another follow-up workshop in which participants will address research questions that cover parliaments from multiple countries, which is then possible because of the uniform format of the parliamentary data.

The agenda and the presentations are available on the ParlaFormat event page

The Rotterdam Exchange Format Initiative (REFI) launches standard for sharing qualitative data across qualitative data analysis software.

By: Liliana Melgar and Marijn Koolen (CLARIAH project)

The Rotterdam Exchange Format Initiative (REFI) consists of a group of software developers and expert qualitative researchers who decided to join efforts in creating a standard for the exchange of data between qualitative data analysis software packages, also called CAQDAS or QDAS.

QDA software packages are designed to facilitate qualitative data analysis. This type of software has existed for more than thirty years (Silver and Patashnick, 2011). According to SoSciSo, an inventory of software used in social science research, there may be more than thirty packages of this type in the market. This makes it difficult for qualitative researchers to choose a package for their research, but also even more difficult to move their data out of or across these packages.

Representing CLARIAH, we attended the launching event of the project exchange format produced by the REFI group, and joined the discussions about the implications and next steps.

The REFI initiative and standard

The REFI initiative originated with the aim to solve the difficulties in exchanging data between QDA software. As Fred van Blommestein explains, the main reasons to facilitate exchange were to make it possible for users to switch to other software packages, exchange data with colleagues, leave a software package to choose another one (not to be locked-in) thus getting the benefits from using the best features of each specific software, and also for result verification (comparing results between packages). An extra reason for creating an exchange format, which was extensively discussed during the launching event, is research data archiving.

The idea to facilitate data exchange between QDA packages began during the KWALON conference in 2010. KWALON is an independent organization of researchers and lecturers at universities, colleges, research agencies and other organizations that deal with the methodology of qualitative social science research. In 2010, the so-called “KWALON experiment” was the first attempt to identify the issues in exchanging qualitative data between these applications, The KWALON Experiment consisted of five developers of Qualitative Data Analysis (QDA) software, all analysing the same dataset regarding the financial crisis in the time period 2008-2009, provided by the conference organisers (an article about this experiment was published in the KWALON journal, FQS, “Forum: Qualitative Social Research” in 2011. Each developer used their own software for the analysis.

During the second KWALON conference, which took place in Rotterdam in 2016, Jeanine Evers, an active member of KWALON since 1995, asked the developers of the QDA packages if they were willing to work on an exchange format. The REFI group was then created and started working right after this conference. Developers from ATLAS.ti, F4 analyse, NVivo, QDA miner, Quirkos, and Transana have been actively working on the standard; also with some participation by developers from Dedoose and MAXQDA. The coordination of the REFI group is done by Fred van Blommestein, Jeanine Evers, Yves Marcoux, Elias Rizkallah, and Christina Silver (see photo).

The REFI initiative has produced two standards:

  • The first product was a “codebook exchange” format, launched in Montreal in March 2018. This format allows users of QDA packages to export their codebooks and import them into any of the programs that implement the format (more about codebooks and the list of software packages which are compatible is at the REFI website).
  • The second product, launched on March 18, 2019 in Rotterdam (see photo with the proud group) is the “project exchange” format, which facilitates the exporting and importing of the main components of a research project done by a researcher with one of the participating software packages. As explained in the REFI website, those components include, among others: the source documents that are analyzed, the segments in those documents that researchers have identified and annotated, the codes and annotations they have assigned to these segments, memos with analytical notes, the links between codes, segments or memos, the cases, the sets/groups of entities, the visual representations of linked entities in the project, and user information.

refi (Source: REFI website)

The launching event

The project exchange format was launched during a workshop event on March 20-21, 2019 in Rotterdam, where besides the REFI group members, other participants from the archival community and infrastructure projects were invited to present and discuss the implications of these exchange formats.

Presenters included:

  • Ron Dekker from Director of CESSDA, the Consortium of European Social Science Data Archives, who pointed to the limitation of some european projects which end up with tools that cannot be sustained in the long term. He argued in favor of an integrated approach to research data infrastructures which provides a “minimum viable ecosystem” for federating existing initiatives and structures within a single, consolidated and seamless platform that would facilitate data provision and exchange between the four major stakeholders: member states, service providers, data producers, and data users.
  • Sebastian Karcher , from the Qualitative Data repository at Syracuse University, introduced us to the QDR repository, which curates, stores, preserves, publishes, and enables the download of digital data generated through qualitative and multi-method research in the social sciences. Sebastian presented the requirements and challenges in providing high quality data services to researchers, which involve not only curation, but also good documentation, assistance, and training.
  • Louise Corti from the UK Data archive founded at the University of Essex in 1967 introduced the collections, users, and main processes of the archive. She highlighted the importance of the QDA exchange standard, since now QDA packages could offer a “deposit” or “archive” button to their users.
  • Rico Simke, a software engineer from the Center for Digital Systems (CeDiS) of the library of Free University Berlin, described the rich qualitative collections that they host, among others, the “Visual history archive”, which contains 52,000 interviews with survivors and witnesses of the Holocaust, and the “Forced labor” collection, which contains 583 interviews with survivors of Nazi forced labor. Rico explained the curation processes to facilitate fine-grained access to these collections, and we all discussed the tension between software for editing and publishing these collections, versus the software to perform qualitative analyses with those collections.
  • René van Horik, from DANS, the Dutch institute for permanent access to digital research resources guided us through the existing certifications for data repositories, he highlighted the importance of the QDA exchange standard, since it facilitates the creation of data management plans for researchers.
  • Steve McEachern, from the Australian Data Archive, and the ANU Center for Social Research and Methods and Qualitative Data, which collects and preserve Australian Social Science data, including 5000 datasets and 1500 studies (including a small set of qualitative research datasets of e.g. election studies, public opinion polls, censuses, administrative data), talked about Dataverse, and the future directions in processing qualitative data. He also discussed the difficulties to separate what is data and what is analysis, and their efforts to try to come up with a process model of qualitative research.
  • Julian Hocker: Ph.D student in Information science at the Leibniz-Institute for research and information in education (DIPF) in Germany, presented his research on a metadata model for qualitative research, which will encourage researchers to share qualitative data, mostly their coding schemes.

Discussion and next steps

At the launching event, the implications of the exchange formats were discussed, mostly focusing at this stage on the requirements for the format to be compatible with the requirements for data deposit at repositories. The participants actively listed the elements required for the standard to be more suitable to this aim. A second version of the exchange format, as well as the dissemination activities among the involved communities and the users of the QDAS packages were listed as the main actions to take by the REFI group in the near future.

In the proposal for CLARIAH-PLUS (p. 8) it is argued that: “The CLARIAH infrastructure will increase our empirical base, options for analysing […] data, and the efficiency of research by orders of magnitude (data-intensive science).”
Ok, but is it true?

Background

Geert Wilder, leader of the Dutch populist party PVV, known for standing up for ordinary people, recently published a tweet (see below) in which he used the completely unknown word 'difficulteren' (doing difficult). Remarkable, because his party is known for their straightforward use of language that even 'ordinary' people can understand.wildersThe reason for the blogs The use of the particular word 'difficulteren' by populist party leader Geert Wilders. The translation of this tweet in English is: The President of Parliament Arib seemed okay yesterday when I spoke to her about awarding Muhammad cartoon prizes in Dutch Parliament during “party day”. Now she is going to difficulteren (doing difficult). Suddenly everything must be done via commission, praesidium, etc..

Linguist Marc van Oostendorp, professor of Dutch Language and Literature at Radboud University in Nijmegen and a passionate blogger, wrote a nice blog about this tweet and  formulated a conjecture about the use of this word. Marten van der Meulen, PhD student and writer, responded to this blog by conducting corpus searches in data that have been made accessible in the CLARIAH infrastructure in order to test Marc’s conjecture. Marten tried to find when  this unknown word ‘difficulteren’ was used for the first time, how often it has been used at all in recent years, and in what contexts it mainly occurred?

The research

‘increase our empirical base'

Marten searched in 6 corpora (Staten Generaal Digitaal, Corpus Gesproken Nederlands, Corpus Hedendaags Nederlands, Brieven als Buit Corpus, Sonar en in the corpora of Nederlab (where it mainly occurs in Early Dutch Books Online). A prominent feature of  CLARIAH is that it allows every humanities scholar to search these resources: you don't have to be a corpus linguist, you don't have to be able to code, you don't have to download corpora or software. CLARIAH offers web applications with user-friendly interfaces that make searching in those corpora easy. See below for links.

'increase options for analysing … data'

These resources make it possible to search by lemma rather than by word, which makes the search and analysis of the search results a lot easier and results in a larger number of relevant data. Moreover, many of the sources contain metadata such as genre, time and place, so that it can also be quickly determined where, when and in which genres this word occurs frequently or less frequently.

'increase the efficiency of research'

Marten did this research within 1 day, something that was not possible before CLARIAH, except perhaps for a select group of corpus linguists.

difficulterenDifficulteren: Oprechte Haerlemsche courant (08-11-1687). Found in the archives of the Library of the Netherlands by searching for ‘difficulteren’ in the search-app of the NederLab-project.

Of course, you can also search the internet, via Google or Twitter. This complements the search in specific corpora, especially since the empirical basis is then even larger. But then one has to look up all the word forms of this verb separately and the analysis of the results requires more (manual) work, especially because there are hardly any relevant metadata. Marten has also searched with Google, but he has not yet been able to analyse the results in that one day. He also searched the Corpus of the Web (COW) for Dutch, smaller than the whole internet but still quite large (7 billion words), and there were fewer hits, so they could be analysed further.

The search query in question concerns a one-word lemma, and that is a relatively simple task. But the CLARIAH infrastructure also allows much more complex searches, with combinations of words, word pairs with a grammatical dependency relationship, and complete grammatical constructions.

Conclusion

My conclusion is therefore that CLARIAH facilitates and already substantiates the above claim.

Do you want to know more, or take a course to make the best use of these tools? Please feel free to contact CLARIAH via: .

Jan Odijk

 

Links

Corpus Hedendaags Nederlands http://corpushedendaagsnederlands.inl.nl/
OpenSoNaR http://opensonar.inl.nl/
Nederlab http://www.nederlab.nl/
PaQu http://www.let.rug.nl/alfa/paqu/info.html
(searching for word pairs with a grammatical dependency relationship)

GrETEL

http://gretel.ccl.kuleuven.be/gretel3/
(searching for grammatical constructions)

General

https://portal.clarin.nl/clariah-tools-fs
(overview of tools and services, still under development)