The Rotterdam Exchange Format Initiative (REFI) launches standard for sharing qualitative data across qualitative data analysis software.

By: Liliana Melgar and Marijn Koolen (CLARIAH project)

The Rotterdam Exchange Format Initiative (REFI) consists of a group of software developers and expert qualitative researchers who decided to join efforts in creating a standard for the exchange of data between qualitative data analysis software packages, also called CAQDAS or QDAS.

QDA software packages are designed to facilitate qualitative data analysis. This type of software has existed for more than thirty years (Silver and Patashnick, 2011). According to SoSciSo, an inventory of software used in social science research, there may be more than thirty packages of this type in the market. This makes it difficult for qualitative researchers to choose a package for their research, but also even more difficult to move their data out of or across these packages.

Representing CLARIAH, we attended the launching event of the project exchange format produced by the REFI group, and joined the discussions about the implications and next steps.

The REFI initiative and standard

The REFI initiative originated with the aim to solve the difficulties in exchanging data between QDA software. As Fred van Blommestein explains, the main reasons to facilitate exchange were to make it possible for users to switch to other software packages, exchange data with colleagues, leave a software package to choose another one (not to be locked-in) thus getting the benefits from using the best features of each specific software, and also for result verification (comparing results between packages). An extra reason for creating an exchange format, which was extensively discussed during the launching event, is research data archiving.

The idea to facilitate data exchange between QDA packages began during the KWALON conference in 2010. KWALON is an independent organization of researchers and lecturers at universities, colleges, research agencies and other organizations that deal with the methodology of qualitative social science research. In 2010, the so-called “KWALON experiment” was the first attempt to identify the issues in exchanging qualitative data between these applications, The KWALON Experiment consisted of five developers of Qualitative Data Analysis (QDA) software, all analysing the same dataset regarding the financial crisis in the time period 2008-2009, provided by the conference organisers (an article about this experiment was published in the KWALON journal, FQS, “Forum: Qualitative Social Research” in 2011. Each developer used their own software for the analysis.

During the second KWALON conference, which took place in Rotterdam in 2016, Jeanine Evers, an active member of KWALON since 1995, asked the developers of the QDA packages if they were willing to work on an exchange format. The REFI group was then created and started working right after this conference. Developers from ATLAS.ti, F4 analyse, NVivo, QDA miner, Quirkos, and Transana have been actively working on the standard; also with some participation by developers from Dedoose and MAXQDA. The coordination of the REFI group is done by Fred van Blommestein, Jeanine Evers, Yves Marcoux, Elias Rizkallah, and Christina Silver (see photo).

The REFI initiative has produced two standards:

  • The first product was a “codebook exchange” format, launched in Montreal in March 2018. This format allows users of QDA packages to export their codebooks and import them into any of the programs that implement the format (more about codebooks and the list of software packages which are compatible is at the REFI website).
  • The second product, launched on March 18, 2019 in Rotterdam (see photo with the proud group) is the “project exchange” format, which facilitates the exporting and importing of the main components of a research project done by a researcher with one of the participating software packages. As explained in the REFI website, those components include, among others: the source documents that are analyzed, the segments in those documents that researchers have identified and annotated, the codes and annotations they have assigned to these segments, memos with analytical notes, the links between codes, segments or memos, the cases, the sets/groups of entities, the visual representations of linked entities in the project, and user information.

refi (Source: REFI website)

The launching event

The project exchange format was launched during a workshop event on March 20-21, 2019 in Rotterdam, where besides the REFI group members, other participants from the archival community and infrastructure projects were invited to present and discuss the implications of these exchange formats.

Presenters included:

  • Ron Dekker from Director of CESSDA, the Consortium of European Social Science Data Archives, who pointed to the limitation of some european projects which end up with tools that cannot be sustained in the long term. He argued in favor of an integrated approach to research data infrastructures which provides a “minimum viable ecosystem” for federating existing initiatives and structures within a single, consolidated and seamless platform that would facilitate data provision and exchange between the four major stakeholders: member states, service providers, data producers, and data users.
  • Sebastian Karcher , from the Qualitative Data repository at Syracuse University, introduced us to the QDR repository, which curates, stores, preserves, publishes, and enables the download of digital data generated through qualitative and multi-method research in the social sciences. Sebastian presented the requirements and challenges in providing high quality data services to researchers, which involve not only curation, but also good documentation, assistance, and training.
  • Louise Corti from the UK Data archive founded at the University of Essex in 1967 introduced the collections, users, and main processes of the archive. She highlighted the importance of the QDA exchange standard, since now QDA packages could offer a “deposit” or “archive” button to their users.
  • Rico Simke, a software engineer from the Center for Digital Systems (CeDiS) of the library of Free University Berlin, described the rich qualitative collections that they host, among others, the “Visual history archive”, which contains 52,000 interviews with survivors and witnesses of the Holocaust, and the “Forced labor” collection, which contains 583 interviews with survivors of Nazi forced labor. Rico explained the curation processes to facilitate fine-grained access to these collections, and we all discussed the tension between software for editing and publishing these collections, versus the software to perform qualitative analyses with those collections.
  • René van Horik, from DANS, the Dutch institute for permanent access to digital research resources guided us through the existing certifications for data repositories, he highlighted the importance of the QDA exchange standard, since it facilitates the creation of data management plans for researchers.
  • Steve McEachern, from the Australian Data Archive, and the ANU Center for Social Research and Methods and Qualitative Data, which collects and preserve Australian Social Science data, including 5000 datasets and 1500 studies (including a small set of qualitative research datasets of e.g. election studies, public opinion polls, censuses, administrative data), talked about Dataverse, and the future directions in processing qualitative data. He also discussed the difficulties to separate what is data and what is analysis, and their efforts to try to come up with a process model of qualitative research.
  • Julian Hocker: Ph.D student in Information science at the Leibniz-Institute for research and information in education (DIPF) in Germany, presented his research on a metadata model for qualitative research, which will encourage researchers to share qualitative data, mostly their coding schemes.

Discussion and next steps

At the launching event, the implications of the exchange formats were discussed, mostly focusing at this stage on the requirements for the format to be compatible with the requirements for data deposit at repositories. The participants actively listed the elements required for the standard to be more suitable to this aim. A second version of the exchange format, as well as the dissemination activities among the involved communities and the users of the QDAS packages were listed as the main actions to take by the REFI group in the near future.

In the proposal for CLARIAH-PLUS (p. 8) it is argued that: “The CLARIAH infrastructure will increase our empirical base, options for analysing […] data, and the efficiency of research by orders of magnitude (data-intensive science).”
Ok, but is it true?

Background

Geert Wilder, leader of the Dutch populist party PVV, known for standing up for ordinary people, recently published a tweet (see below) in which he used the completely unknown word 'difficulteren' (doing difficult). Remarkable, because his party is known for their straightforward use of language that even 'ordinary' people can understand.wildersThe reason for the blogs The use of the particular word 'difficulteren' by populist party leader Geert Wilders. The translation of this tweet in English is: The President of Parliament Arib seemed okay yesterday when I spoke to her about awarding Muhammad cartoon prizes in Dutch Parliament during “party day”. Now she is going to difficulteren (doing difficult). Suddenly everything must be done via commission, praesidium, etc..

Linguist Marc van Oostendorp, professor of Dutch Language and Literature at Radboud University in Nijmegen and a passionate blogger, wrote a nice blog about this tweet and  formulated a conjecture about the use of this word. Marten van der Meulen, PhD student and writer, responded to this blog by conducting corpus searches in data that have been made accessible in the CLARIAH infrastructure in order to test Marc’s conjecture. Marten tried to find when  this unknown word ‘difficulteren’ was used for the first time, how often it has been used at all in recent years, and in what contexts it mainly occurred?

The research

‘increase our empirical base'

Marten searched in 6 corpora (Staten Generaal Digitaal, Corpus Gesproken Nederlands, Corpus Hedendaags Nederlands, Brieven als Buit Corpus, Sonar en in the corpora of Nederlab (where it mainly occurs in Early Dutch Books Online). A prominent feature of  CLARIAH is that it allows every humanities scholar to search these resources: you don't have to be a corpus linguist, you don't have to be able to code, you don't have to download corpora or software. CLARIAH offers web applications with user-friendly interfaces that make searching in those corpora easy. See below for links.

'increase options for analysing … data'

These resources make it possible to search by lemma rather than by word, which makes the search and analysis of the search results a lot easier and results in a larger number of relevant data. Moreover, many of the sources contain metadata such as genre, time and place, so that it can also be quickly determined where, when and in which genres this word occurs frequently or less frequently.

'increase the efficiency of research'

Marten did this research within 1 day, something that was not possible before CLARIAH, except perhaps for a select group of corpus linguists.

difficulterenDifficulteren: Oprechte Haerlemsche courant (08-11-1687). Found in the archives of the Library of the Netherlands by searching for ‘difficulteren’ in the search-app of the NederLab-project.

Of course, you can also search the internet, via Google or Twitter. This complements the search in specific corpora, especially since the empirical basis is then even larger. But then one has to look up all the word forms of this verb separately and the analysis of the results requires more (manual) work, especially because there are hardly any relevant metadata. Marten has also searched with Google, but he has not yet been able to analyse the results in that one day. He also searched the Corpus of the Web (COW) for Dutch, smaller than the whole internet but still quite large (7 billion words), and there were fewer hits, so they could be analysed further.

The search query in question concerns a one-word lemma, and that is a relatively simple task. But the CLARIAH infrastructure also allows much more complex searches, with combinations of words, word pairs with a grammatical dependency relationship, and complete grammatical constructions.

Conclusion

My conclusion is therefore that CLARIAH facilitates and already substantiates the above claim.

Do you want to know more, or take a course to make the best use of these tools? Please feel free to contact CLARIAH via: .

Jan Odijk

 

Links

Corpus Hedendaags Nederlands http://corpushedendaagsnederlands.inl.nl/
OpenSoNaR http://opensonar.inl.nl/
Nederlab http://www.nederlab.nl/
PaQu http://www.let.rug.nl/alfa/paqu/info.html
(searching for word pairs with a grammatical dependency relationship)

GrETEL

http://gretel.ccl.kuleuven.be/gretel3/
(searching for grammatical constructions)

General

https://portal.clarin.nl/clariah-tools-fs
(overview of tools and services, still under development)

Last week, the 16th International Semantic Web Conference (ISWC 2017) took place in Vienna, Austria. Around 600 researchers from all over the world came together to exchange knowledge and ideas in 7 tutorials, 18 workshops, and 3 full days of keynotes, conference talks, and a big poster & demo session. Needless to say, I only saw a small part of it, but all the papers and many of the tutorial materials are avaialble through the conference website

First of all, kudos to the organising committee for putting together a fantastic programme and great overall surroundings. The WU Campus (workshops, posters & demos and jam session) has a really gorgeous campus with a marvellous spaceship-like library.

The main conference took place next door at the Messe, where the Wifi worked excellently (quite a feat at a CS conference where most participants carry more than one device). The bar for next year is set high! 

But back to the conference: 

On Sunday, I got to present the SERPENS CLARIAH research pilot during the Second Workshop on Humanities in the Semantic Web (WHISE II). There were about 30 participants in the workshop, and a variety of projects and topics was presented. I particularly liked the presentation by Mattia Egloff on his and Davide Picca's work on DHTK: The Digital Humanities ToolKit. They are working on a python module that supports analysis of books and they are developing and testing it for an undergraduate course for humanities students. I really think that by providing (humanities) students with tools to start doing their own analyses, we can get them enthusiastic about programming, as well as thinking about the limitations of such tools, which can lead to better projects in the long run. 

In the WHISE workshop, as well as in the main conference, there were several presentations on multimedia datasets for the Semantic Web. The multimedia domain is not new to Semantic Web, but some of the work (such as Rick Meerwaldt, Albert Meroño-Peñuela and Stefan Schlobach. Mixing Music as Linked Data: SPARQL-based MIDI Mashups Mashups) doesn't just focus on the metadata but actually encodes the MIDI signal as RDF and then uses it for a mashup.

Another very interesting  resource is IMGpedia, created by Sebastián Ferrada, Benjamin Bustos and Aidan Hogan, which was presented in a regular session (winner best student resource paper) as well as during the poster session (winner best poster). The interesting thing about this resource is that it doesn't only allow you to query on metadata elements, but also on visual characteristics. 

IMG 7976

Metadata and content features are also combined in The MIDI Linked Data Cloud by Albert Meroño-Peñuela, Rinke Hoekstra, Victor de Boer, Stefan Schlobach, Berit Janssen, Aldo Gangemi, Alo Allik, Reinier de Valk, Peter Bloem, Bas Stringer and Kevin Page which would for example make studies in ethnomusicology possible. I think such combinations of modalities is super exciting for humanities research where we work with extremelty rich information sources and often need to/want to combine sources to answer our research questions. 

Enriching and making available cultural heritage data is also a topic that keeps popping up at ISWC, this year there was for example "Craig Knoblock, Pedro Szekely, Eleanor Fink, Duane Degler, David Newbury, Robert Sanderson, Kate Blanch, Sara Snyder, Nilay Chheda, Nimesh Jain, Ravi Raju Krishna, Nikhila Begur Sreekanth and Yixiang Yao: Lessons Learned in Building Linked Data for the American Art Collaborative". This project was a pretty big undertaking in terms of aligning and mapping museum collections. I really like that the first lesson learnt to create reproducible workflows: 

IMG 7960

This doesn't only hold for conversion of museum collections, but for all research. But it's still nice to see mentioned here. Reproducibility is also a motivator in "Tobias Kuhn, Egon Willighagen, Chris Evelo, Núria Queralt Rosinach, Emilio Centeno and Laura Furlong: Reliable Granular References to Changing Linked Data" which investigates the use of nanopublications to enable referring to items or subsets within data collections for finegrained referencing of previous work.

My favourite keynote at this conference (and they had three excellent ones) was by Jamie Taylor, formerly of Freebase, now Google. He argued for more commonsense knowledge in our knowledge graphs. While I do think that is a great vision, as many of our resources lack this leading to all sorts of weird outcomes in for instance named entity linking (you can ask Filip Ilievski for the funniest examples) it was unclear how to go about this this and whether this would be possible at all. The examples he gave in the keynote for toasters and kettles would work out just fine (kettles heat up water, toasters heat up baked goods) but for complex concepts such as murders (Sherlock Holmes anyone?) I'm not sure how this would work. But enough food for thought. See also Pascal Hitzler's take on this keynote

For other highlights of the conference, check out these other trip reports by Juan Sequeda and Paul Groth.

 

See you in Monterey, California next year? 

IMG 7990

 

clarin logoSTSubmitted by Karolina Badzmierowska on 23 October 2017

Tour de CLARIN

“Tour de CLARIN” is a new CLARIN ERIC initiative that aims to periodically highlight prominent User Involvement (UI) activities of a particular CLARIN national consortium. The highlights include an interview with one or more prominent researchers who are using the work of national consortium’s infrastructure and can tell us more about their experience with CLARIN in general; one or more use cases that the consortium is particularly proud of and any relevant user involvement activities carried out. “Tour de CLARIN“ helps to increase the visibility of the national consortia, reveal the richness of the CLARIN landscape, and to display the full range of activities throughout the network. The content is disseminated via the CLARIN Newsflashblog posts and linked to on our social media: Twitter and Facebook.

The Netherlands

CLARIAH-NL is a project in the Netherlands that is setting up a distributed research infrastructure that provides humanities researchers with access to large collections of digital data and user-friendly processing tools. The Netherlands is a member of both CLARIN ERIC and DARIAH ERIC, so CLARIAH-NL contributes therefore not only to CLARIN but also to DARIAH. CLARIAH-NL not only covers humanities disciplines that work with natural language (the defining characteristics of CLARIN) but also disciplines that work with structured quantitative data. Though CLARIAH aims to cover the humanities as a whole in the long run, it currently focusses on three core disciplines: linguistics, social-economic history, and media studies.

CLARIAH-NL is a partnership that involves around 50 partners from universities, knowledge institutions, cultural heritage organizations and several SAB-companies, the full list of which can be found here. Currently, the data and applications of CLARIAH-NL are managed and sustained at eight centres in the Netherlands: Huygens Ing, the Meertens Institute, DANS, the International Institute for Social History, the Max Planck Institute for Psycholinguistics, the Netherlands Institute for Sound and Vision, the National Library of the Netherlands, and the Institute of Dutch Language. Huygens Ing, The Meertens Institute, the Max Planck Institute for Psycholinguistics, and the Institute of Dutch Language  are Certified CLARIN Type B centres. The consortium is led by an eight-member board and its director and national coordinator for CLARIN ERIC is Jan Odijk.

The research, development and outreach activities at CLARIAH-NL are distributed among five work packages: Dissemination and Education (WP1) and Technology (WP2) deal respectively with User Involvement and the technical design and construction of the infrastructure, whereas the  remaining three work packages focus on three selected research areas: Linguistics (WP3), Social and Economic History (WP4) and Media Studies (WP5).

 

The full blog can be read here: https://www.clarin.eu/blog/tour-de-clarin-netherlands

 

zurich

17 october 2017, Christian Olesen

Early September, Liliana Melgar and I (Christian Olesen) received an invitation from Barbara Flückiger, Professor in Film Studies at the University of Zürich, to participate in the “Colloquium Visualization Strategies for the Digital Humanities”. The aim of the day was to bring together experts to discuss film data visualization opportunities in relation to Professor Flückiger’s current research projects on the history of film colors. Currently, Flückiger leads two large-scale projects on this topic: the ERC Advanced Grant FilmColors (2015-2020) and the Filmfarben project funded by the Swiss National Science Foundation (2016-2020). A presentation of the projects’ team members can be found here.

As a scholar, Barbara Flückiger has in-depth expertise on the interrelation between film technology, aesthetics and culture covering especially aspects of film sound, special effects, film digitization and film colors in her research. In recent years, her research has increasingly focussed on film colors, especially since the launch of the online database of film colors Timeline of Historical Film Colors in 2012 after a successful crowdfunding campaign. The Timeline of Historical Film Colors has since grown to become one of the leading authoritative resources on the history and aesthetics of film colors – it is presented as “a comprehensive resource for the investigation of film color technology and aesthetics, analysis and restoration”. It is now consolidating this position as it is being followed up by the two large-scale research projects mentioned above which merge perspectives from film digitization, restoration, aesthetic and cultural history.

These projects are entering a phase in which the involved researchers are beginning to conceive ways of visualizing the data they have created so far and need to consider the potential value which data visualization may have for historical research on film color aesthetics, technology and reception.

In the full report with a lot of impressions from the vist can be read here.