(by Marieke van Erp)
On 19 and 20 June, the First International Conference on Language, Data and Knowledge (LDK2017) took place in Galway, Ireland. The conference wasn’t too big (~80 participants) and featured a broad and interesting single track programme. It had been a while since I had attended a single track conference, and I had kind of forgotten how much I liked it, so I hope the organisers keep that for the next edition (Leipzig 2019).
CLARIAH collaborator Antal van den Bosch kicked off the conference with the first keynote titled “Processing Text as Socio-Economic and Cultural Data” in which he featured several social sciences and digital humanities text analytics use cases. I really liked his call for a holistic approach to language (which I interpreted as lying at the heart of the conference theme) namely combining whatever information and approaches are available to answer the deeper questions:
After his talk, an audience member mentioned that he found that approaches presented at digital humanities conferences are often still fairly coarse-grained, which may be a result of researcher expecting 100% accuracy. This is something that I have noticed before and which was also a big theme in the second keynote of the conference, by Zoltán Slávik (IBM) who argued that technology developers have a huge responsibility to manage expectations. I think Antal’s answer to the audience question reflected this, and he included a remark on keeping the human in the loop, which is also the direction IBM seems to be taking with porting Watson to the medical domain.
Of particular interest
Most of the talks were really interesting, and for the full programme and proceedings see the conference website. Here few papers that I think are particularly interesting to the CLARIAH community.
On the creation of resources: There was an interesting paper on the creation of an ontology for linguistic terminology (OnLiT: An Ontology for Linguistic Terminology, Bettina Klimek, John P. McCrae, Christian Chiarcos and Sebastian Hellmann) which aims to provide an interoperable model and dataset for linguistic terminology. One of the things we have run into in WP3 is that there are different glossaries etc around for describing different linguistic concepts, perhaps OnLiT is an interesting option to look at to start integrating them. One issue that may arise came up in Maria Keet’s presentation on Representing and aligning similar relations: parts and wholes in isiZulu vs English where certain concepts present in one language, may not exist in another, or only partly. I am not sure yet whether OnLiT can represent all of this, but it is still a work in progress.
Another issue in resource creation is the fact that the resource will always be a snapshot of the language at the time the resource was created. One of our most commonly used resources in language technology is WordNet, but it hasn’t been updated for 10 years. “To tweet” then was a verb that applied to birds, now it refers to creating a microblog. John P. McCrae and Ian Wood presented a paper they wrote together with Amanda Hicks in which they aimed to extend WordNet with Neologisms by gathering terms from Twitter and Reddit and filtering them through various sieves.
During the very nice poster session, some interesting digital humanities use cases were presented. The first two are by the group of Hyvönen in Finland: Named Entity Linking in a Complex Domain: Case Second World War History by Erkki Heino, Minna Tamper, Eetu Mäkelä, Petri Leskinen, Esko Ikkala, Jouni Tuominen, Mikko Koho and Eero Hyvönen and Reassembling and Enriching the Life Stories in Printed Biographical Registers: High School Alumni on the Semantic Web by Eero Hyvönen, Petri Leskinen, Erkki Heino, Jouni Tuominen and Laura Sirola
What I liked about these is that they deal with real dirty data, and provide interesting examples for the things we can do with data from for example NIOD and biographical resources.
Another highlight of the poster session for me was Exploring the Role of Gender in 19th Century Fiction Through the Lens of Word Embeddings by Siobhán Grayson, Maria Mulvany, Karen Wade, Gerardine Meaney and Derek Greene. One reason for me to be interested in this is that I’m supervising an MSc thesis that deals with automatic analysis of novels, the other is that I liked how they visualised their results, which I think is very important, especially when working in interdisciplinary settings.
- Gaelic is super interesting, but also super complex (as Graham Isaac’s keynote made clear)
- The crowd was still quite technical, more humanities researchers in attendance may spark even more interesting cross-disciplinary conversations
- The weather in Ireland is really not that bad (but don’t forget your waterproof jacket)
- Kathleen McKeown (University of Columbia, the third keynote speaker) is definitely someone whose work to look into as I already mentioned in this blog post.
- Why aren’t more conferences doing BBQs?
On Tuesday 13 June 2017, the second CLARIAH Linked Data workshop took place. After the first workshop in September which was very much an introduction to Linked Data to the CLARIAH community, we wanted to organise a more hands-on workshop where researchers, curators and developers could get their hands dirty.
The main goal of the workshop was to introduce relevant tools to novice as well as more advanced users. After a short plenary introduction, we therefore split up the group where for the novice users the focus was on tools that are accompanied by a graphical user interface, like OpenRefine and Gephi; whereas we demonstrated API-based tools to the advanced users, such as the CLARIAH-incubated COW, grlc, Cultuurlink and ANANSI. Our setup, namely to have the participants convert their own dataset to Linked Data and query and visualise, was somewhat ambitious as we had not taken into account all data formats or encodings. Overall, participants were able to get started with some data, and ask questions specific to their use cases.
It is impossible to fully clean and convert and analyse a dataset in a single day, so the CLARIAH team will keep investigating ways to support researchers with their Linked Data needs. For now, you can check out the slides and tutorial materials from the workshop and keep an eye out on this website for future CLARIAH LOD events.
by Stef Scagliola.
This blog already appeared on the website of C2DH
The challenge of this workshop on 'transcription and technology’, which took place from 10 to 12 May in Arezzo (IT), consisted in turning recorded human speech into a textual representation that is as close as possible to what has been uttered.
Say Arezzo and any art historian’s thoughts will divert to Piero della Francesca’s innovation in the visual representation of reality. It was this renaissance painter of the 15th century who in his frescos dared to depicture religious figures as real humans. By creating the illusion of depth and light and by paying detailed attention to the anatomy of their bodies, Christ, the Madonna and other holy personalities were no longer spiritual creatures floating in the air, but looked like real people.
Goal of the workshop
The efforts of a workshop on 'transcription and technology’, which we attended from 10 to 12 May in this beautiful Tuscan town, were also geared towards an accurate representation of reality. The main goal of the workshop was to come forward with a proposal for a Transcription Chain. A set of web-based services turning recorded human speech into a textual representation that is as close as possible to what has been uttered.
Speech recognition technology can be relevant for humanities research in two ways: it can open up huge amounts of spoken data in archives of which the content is mostly unknown, and it can speed up the lengthy process of manual transcription for scholars who want to analyze their interviews in depth.
The importance of studying speech is evident when taking into consideration the role of recorded voice and moving image for the human expression of the 20th century. Digital tools have already conquered the world of text, magnifying the scale and speed at which phenomena can be observed, but too little attention is given to how we use spoken language and memory to shape our lived experiences into a set of meaningful and coherent stories.
CLARIN ERIC, the European infrastructure offering digital data and tools for Digital Humanities scholarly research. It was created to serve a broad range of scholars, but until recently it was foremost a much cherished treasure trove for linguists.With this agenda in mind, a mix of Italian, Dutch, British, Czech and German oral historians, linguists, data specialists and speech technologists got together to assess which Digitization, Speech Retrieval, Alignment and Transcription tools are suitable for creating a semi-automated workflow that can turn analogue recordings into readable transcripts. The workshop was supported by
The increased interest for cross-disciplinary approaches to data is the appropriate context for making efforts to recruit more enthusiastic users from the humanities and social science fields. This objective begs the question of which requirements are relevant to which type of scholar who works with speech data. It also asks scholars to step out of their 'comfort zone' and consider other approaches.
Contributors were asked to present an overview of conventions and practices that should be considered to create a well suited workflow: What are the metadata schemes used in speech data? What are the guidelines for transcription? What are existing digital infrastructures capable of providing? And what has proprietary commercial software already have in store? After the technical partners presented a parade of tools, the real fun part started: testing the various tools with 5 minutes clips of audio.
The speech recognition tools are of course language-specific. For English they could try out the web service offered by Sheffield University. The Dutch could try out the ASR-service of the Radboud University in Nijmegen, and the Italians could practice with the stand-alone alignment software ‘Segmenta’ created by Piero Cosi at the CNR in Padua.
As expected, the speech retrieval software performed poorly with clips containing language with strong regional accents, such as a corpus with Tuscan dialect, or an interview with a narrator speaking with a Flemish/Moroccan accent. The good news is however, that it performed excellently with language clips that contained regular speech, and that this applied to all three languages.
For the less technically savvy scholars, it was a surprise to hear how confident speech technologists were about the chances of success when trying to customize speech recognition software to work on non-standard language varieties and lesser-researched languages. The most important requirements seem to be to have enough training material in the form of a lexicon, a language model and an acoustic model that can be fed into the software. Success and low word error rates (WER) appear to be a question of scale, training and perseverance. This might raise new hopes for mobilizing awareness and fostering research on small language groups such as Luxemburgisch.
The next step in the workflow is the transcription. Several tools were presented, such as OCTRA-2D from München, SubtitleEdit, created by Danish developers, and an unexpected contribution from the world of journalism was OTranscribe, which seemed to be the easiest to handle. The challenge is of course to customize these tools in a way that they can effectively import the outputs of the speech recognition, so that the correction can begin, without having to do any additional clean up or re-structuring.
What was striking when observing the various conventions, is that only sound-based speech studies use time codes. When is comes to studying the interpretation of what is uttered, meaning that you need whole utterances to grasp the meaning and context, there is no tradition of documenting time codes in the metadata. This means that a lot of ‘conversion' in the persuasive sense of the word has to be done, to have humanities and social science scholars make optimal use of digital tools.
The last step in the workflow is the alignment, connecting the audio signal to the transcription. This facilitates browsing and searching through an entire corpus of recordings, and can easily be done with ASR output that is not completely correct. For this part of the chain, the Bavarian Archive for Speech Signals has provided WebMAUS, an open source webservice for phoneticians. The demonstration by Christoph Draxler showed that this resource has many more features that could be utilized than was initially known by the organizers.
The Bavarian Archive for Speech Signals already conducts online-experiments and web-based audio transcriptions via crowd-sourcing. Due to the lack of familiarity with other disciplines these functionalities had not been offered to other target groups. These ‘surprises’ were recurrent during the workshop, and showed that mixing disciplines opens up the bubble of your own research network. Programs that for some scholars represent mainstream technology had the impact of real revelations to others.
This was certainly the case with a number of Italian PhD’s. Two moments were exemplary for the diversity in the use of criteria for quality and terminology. The first was when speech technologist Piero Cosi informed linguist Silvia Calamai that her ‘best piece of recording’, had performed the ‘worst’ of all clips. The other was when it became clear that in Thomas Hain's interface for speech recognition, the field ‘metadata’ did not refer to the convention of completing a template with the properties of a document. In this web resource, its function was to encourage uploading textual documents that cover the topic of the sound recording, in order to improve the recognition performance. It was also discovered that linguists and social scientists can mean very different things when they talk about ‘annotation’.
A last component of the chain was also considered: creating a community to crowdsource the transcription of an interview collection. The sensational success of crowdsourcing personal written documents, promises good results as long as the workflow is arranged properly. The platform Crowdflower could provide such a structure. With such projects, there are advantages and disadvantages when we compare dedicated platforms such as Crowdflower, or Zooniverse, or consider using our own platforms for crowdsourced projects. Dedicated platforms provide lots of functionality for building and maintaining a community of volunteers, but allowing the researchers limited control over the data and software hosted on the platform. Using our own websites to carry out such projects would require lots of improvements in the user interfaces, and lots of effort to reach people and keep them involved.
Of course there were also undercurrents of scepticism, which can ’spoil the party’, but they deserve a prominent role in the assessment of the potential. These refer to the limits of the efficiency of customizing tools that are created by scholars with no commercial interest and who will eventually retire or change jobs.
Another objection was to the top-down approach, the idea that there is a chain and that by customizing existing tools that were created for other purposes, you can cater for a variety of scholars. An alternative would be choosing one discipline, observing all practices attentively, and designing the best tool or tools to fit these practices.
These objections warn against setting no limits to the customization and against presenting the chain as a service to all scholars that will maintained eternally. But academics are not eternal, they are mortal creatures who are supposed to produce new knowledge, not services. On the other hand, these type of arguments can also paralyze creativity and enthusiasm, and the will to collaborate for a common goal. The ideal setting for creating optimal services in a non commercial environment will probably remain a dream. So to push the further development of open source resources we are bound to reach compromises and to take small steps.
The setting in Arezzo was perfect. A mix of nationalities, generations and disciplines engaged in opening up stories about ordinary people, and last but not least, a warm and thoughtful reception by our hosts Silvia Calamai, Francesca Biliotti, Simona Matteini and Caterina Pesce.
For people heading to Arezzo this summer: try La Lancia d’Oro and l’Agania. Readers who want to know more about Oral History and Technology can take a look at the Oral History website curated by Arjan van Hessen and Henk van de Heuvel.
If you are interested in the progress of our effort to create a transcription chain, or are willing to share your experiences with trying out the tools mentioned in this blog, this is the place to be.
On February 6 and 7 CLARIAH WP3 organized a workshop to discuss the application of Linked Data for linguistic research. The workshop that went under the appropriate acronym of LD4LR, invited presentations from a number of foreign experts as also a number of representatives from CLARIN centers that had made some experience using Linked Data in their projects. The workshop concentrated on the perspective of the linguistic researcher that is increasingly confronted with all kinds of information about Linked Data and that needs to know what Linked Data can bring to her research. A number of prominent Dutch linguists were invited to present their current research topics where subsequently our experts could make suggestions on how to apply Linked Data paradigms to the researcher’s benefit. The invited experts that next to being Linked Data experts also are active linguists presented their efforts with Linked Data in the fields of Lexica, Phonetics and Treebanks.
Overall there was sufficient time for good discussions, where the experts tried to avoid too specific terminology and concentrate on user needs. In the round-up summaries from day one and two Sjef Barbiers and Jan Odijk concluded that although interesting things happen with Linked Data in linguistics, it seems not immediately useable for the end-user researcher unless they themselves are very familiar with Linked Data as the invited experts are. To make the potential of Linked Data use benefit a broader group of linguists, we need a better bridge between technologists and researchers. Dedicated pilots in WP3 should stimulate investigation of the usefulness of Linked Data application to different types of linguistic research, esp. Lexical resources (DUELME). From a data provisioning perspective, the benefits of Linked Data for interoperability purposes are clear.
|13:00 – 13:15||Welcome, (Why this Workshop)||Daan Broeder|
|13:15 – 14:00||Why should I use LD for my research?,
LD in Comparative Syntax
|Nicoline van der Sijs
|14:00 – 15:30||Broad overview: What kind of (L)L(O)D is available? What linguistic research has been done using it?|
|14:00 – 14:30||John McCrae|
|14:30 – 15:00||Steven Moran|
|15:00 – 15:30||Giuseppe Celano|
|15:30 – 16:00||break|
|16:00 – 17:30||Experiences of CLARIAH/CLARIN centers|
|16:00 – 16:30||Antske Fokkens
Willem van Hage
|16:30 – 17:00||Matej Durco|
|17:00 – 17:30||Thomas Eckart
|17:30 – 17:15||Wrap-up day 1||Sjef Barbiers|
|09:00 – 09:15||Outlook Day 2||Menzo Windhouwer|
|09:15 – 11:00||Linguistic research case studies|
|09:15 – 09:45||Introduction|
|Marjo van Koppen,
|09:45 – 11:00||How would you use LD for this research? Expert responses||John McCrae
|11:00 – 11:30||break|
|11:30 – 12:00||Dieter van Uytvanck
|12:00 – 13:00||lunch|
|13:00 – 13:30||Linked Data opportunities & limitations||Daan Broeder|
|13:30 – 14:00||Conclusions||Jan Odijk|
On 27 October 2016, the University of Amsterdam opened its doors to The Humanities And Technology Camp (THATCamp).
In recent years the THATCamp formula has crossed the Atlantic and spread over Europe. At THATCamp Amsterdam we came to fully understand the reason for THATCamp’s success: THATCamp is a playful, informal and fun event where programmers and humanities scholars are able to meet, learn about each other's work, toy around with different types of software, and make plans for a collaborative projects in the future.
At THATCamp Amsterdam topics ranged widely: from the web’s unboundedness to the use of crowdsourcing in research, from the spread of cinemas in the Netherlands to the role of machines on the work floor. Linked Open Data practitioners exchanged working techniques, while Art Historians explored best computational research practices and an Amsterdam historical GIS hotspot took shape. In between there was coffee, salad and "broodjes", and by the end of the day new plans had emerged for collaborative work on Amsterdam’s Creative Industries, from various perspectives and on multiple scales.
For anyone organizing a THATCamp, the catch in the formula is that THATCamp does not really want to be organized top-down. As can be read on the official website, THATCamp is an "unconference": it is participatory ("there are no spectators at a THATCamp"), informal (there are "no lengthy proposals, papers, presentations"), productive (the focus is on "collegial work or free-form discussion"), flat structured ("non-hierarchical”), and crucially bottom-up: at THATCamp, the program is created by all participants together, "on the spot" as part of a collective voting session.
For the record: we need not have worried. THATCamp recommends avoiding web-based technology to facilitate the voting, arguing that “the in-person method works well and is fun.” In Amsterdam, this participatory, personal approach of the first session resonated well with the general enthusiastic and constructive attitude of the THATCamp participants. As it turns out, a small collection of post-its, clothespegs, a few sheets of paper and a large dose of enthusiasm and curiosity may just be the perfect toolkit to start a day of collectively exploring the intersections of humanities scholarship and technology.
THATCamp Amsterdam was hosted by the research project Creative Amsterdam: an E-Humanities Perspective (CREATE), at the Amsterdam Centre for Cultural Heritage and Identity. An impression of THATCamp Amsterdam, including a list of session proposals, may be found on the THATCamp Amsterdam webpage and the CREATE blog.
For more information on other events and research projects carried out within the CREATE Program, please visit the CREATE page.
- 29-06-2017 LDK Trip report
- 20-06-2017 Report CLARIAH Linked Data Workshop 2
- 25-05-2017 Catching Speech in Arezzo: A Clarin workshop for developing a transcription-chain for Oral History
- 23-02-2017 LD4LR: Linked Data for Linguistic Research
- 17-11-2016 THATCamp Amsterdam 2016: happy afterthoughts
- 16-11-2016 Team CLARIAH wins Audience Award at Hackalod 2016
- 16-11-2016 ISWC 2016
- 07-10-2016 The Role of Narratives in DIVE
- 16-09-2016 CLARIAH Linked Data Workshop
- 24-07-2016 Audiovisual Data And Digital Scholarship: Towards Multimodal Literacy