by Stef Scagliola.
This blog already appeared on the website of C2DH
The challenge of this workshop on 'transcription and technology’, which took place from 10 to 12 May in Arezzo (IT), consisted in turning recorded human speech into a textual representation that is as close as possible to what has been uttered.
Say Arezzo and any art historian’s thoughts will divert to Piero della Francesca’s innovation in the visual representation of reality. It was this renaissance painter of the 15th century who in his frescos dared to depicture religious figures as real humans. By creating the illusion of depth and light and by paying detailed attention to the anatomy of their bodies, Christ, the Madonna and other holy personalities were no longer spiritual creatures floating in the air, but looked like real people.
Goal of the workshop
The efforts of a workshop on 'transcription and technology’, which we attended from 10 to 12 May in this beautiful Tuscan town, were also geared towards an accurate representation of reality. The main goal of the workshop was to come forward with a proposal for a Transcription Chain. A set of web-based services turning recorded human speech into a textual representation that is as close as possible to what has been uttered.
Speech recognition technology can be relevant for humanities research in two ways: it can open up huge amounts of spoken data in archives of which the content is mostly unknown, and it can speed up the lengthy process of manual transcription for scholars who want to analyze their interviews in depth.
The importance of studying speech is evident when taking into consideration the role of recorded voice and moving image for the human expression of the 20th century. Digital tools have already conquered the world of text, magnifying the scale and speed at which phenomena can be observed, but too little attention is given to how we use spoken language and memory to shape our lived experiences into a set of meaningful and coherent stories.
CLARIN ERIC, the European infrastructure offering digital data and tools for Digital Humanities scholarly research. It was created to serve a broad range of scholars, but until recently it was foremost a much cherished treasure trove for linguists.With this agenda in mind, a mix of Italian, Dutch, British, Czech and German oral historians, linguists, data specialists and speech technologists got together to assess which Digitization, Speech Retrieval, Alignment and Transcription tools are suitable for creating a semi-automated workflow that can turn analogue recordings into readable transcripts. The workshop was supported by
The increased interest for cross-disciplinary approaches to data is the appropriate context for making efforts to recruit more enthusiastic users from the humanities and social science fields. This objective begs the question of which requirements are relevant to which type of scholar who works with speech data. It also asks scholars to step out of their 'comfort zone' and consider other approaches.
Contributors were asked to present an overview of conventions and practices that should be considered to create a well suited workflow: What are the metadata schemes used in speech data? What are the guidelines for transcription? What are existing digital infrastructures capable of providing? And what has proprietary commercial software already have in store? After the technical partners presented a parade of tools, the real fun part started: testing the various tools with 5 minutes clips of audio.
The speech recognition tools are of course language-specific. For English they could try out the web service offered by Sheffield University. The Dutch could try out the ASR-service of the Radboud University in Nijmegen, and the Italians could practice with the stand-alone alignment software ‘Segmenta’ created by Piero Cosi at the CNR in Padua.
As expected, the speech retrieval software performed poorly with clips containing language with strong regional accents, such as a corpus with Tuscan dialect, or an interview with a narrator speaking with a Flemish/Moroccan accent. The good news is however, that it performed excellently with language clips that contained regular speech, and that this applied to all three languages.
For the less technically savvy scholars, it was a surprise to hear how confident speech technologists were about the chances of success when trying to customize speech recognition software to work on non-standard language varieties and lesser-researched languages. The most important requirements seem to be to have enough training material in the form of a lexicon, a language model and an acoustic model that can be fed into the software. Success and low word error rates (WER) appear to be a question of scale, training and perseverance. This might raise new hopes for mobilizing awareness and fostering research on small language groups such as Luxemburgisch.
The next step in the workflow is the transcription. Several tools were presented, such as OCTRA-2D from München, SubtitleEdit, created by Danish developers, and an unexpected contribution from the world of journalism was OTranscribe, which seemed to be the easiest to handle. The challenge is of course to customize these tools in a way that they can effectively import the outputs of the speech recognition, so that the correction can begin, without having to do any additional clean up or re-structuring.
What was striking when observing the various conventions, is that only sound-based speech studies use time codes. When is comes to studying the interpretation of what is uttered, meaning that you need whole utterances to grasp the meaning and context, there is no tradition of documenting time codes in the metadata. This means that a lot of ‘conversion' in the persuasive sense of the word has to be done, to have humanities and social science scholars make optimal use of digital tools.
The last step in the workflow is the alignment, connecting the audio signal to the transcription. This facilitates browsing and searching through an entire corpus of recordings, and can easily be done with ASR output that is not completely correct. For this part of the chain, the Bavarian Archive for Speech Signals has provided WebMAUS, an open source webservice for phoneticians. The demonstration by Christoph Draxler showed that this resource has many more features that could be utilized than was initially known by the organizers.
The Bavarian Archive for Speech Signals already conducts online-experiments and web-based audio transcriptions via crowd-sourcing. Due to the lack of familiarity with other disciplines these functionalities had not been offered to other target groups. These ‘surprises’ were recurrent during the workshop, and showed that mixing disciplines opens up the bubble of your own research network. Programs that for some scholars represent mainstream technology had the impact of real revelations to others.
This was certainly the case with a number of Italian PhD’s. Two moments were exemplary for the diversity in the use of criteria for quality and terminology. The first was when speech technologist Piero Cosi informed linguist Silvia Calamai that her ‘best piece of recording’, had performed the ‘worst’ of all clips. The other was when it became clear that in Thomas Hain's interface for speech recognition, the field ‘metadata’ did not refer to the convention of completing a template with the properties of a document. In this web resource, its function was to encourage uploading textual documents that cover the topic of the sound recording, in order to improve the recognition performance. It was also discovered that linguists and social scientists can mean very different things when they talk about ‘annotation’.
A last component of the chain was also considered: creating a community to crowdsource the transcription of an interview collection. The sensational success of crowdsourcing personal written documents, promises good results as long as the workflow is arranged properly. The platform Crowdflower could provide such a structure. With such projects, there are advantages and disadvantages when we compare dedicated platforms such as Crowdflower, or Zooniverse, or consider using our own platforms for crowdsourced projects. Dedicated platforms provide lots of functionality for building and maintaining a community of volunteers, but allowing the researchers limited control over the data and software hosted on the platform. Using our own websites to carry out such projects would require lots of improvements in the user interfaces, and lots of effort to reach people and keep them involved.
Of course there were also undercurrents of scepticism, which can ’spoil the party’, but they deserve a prominent role in the assessment of the potential. These refer to the limits of the efficiency of customizing tools that are created by scholars with no commercial interest and who will eventually retire or change jobs.
Another objection was to the top-down approach, the idea that there is a chain and that by customizing existing tools that were created for other purposes, you can cater for a variety of scholars. An alternative would be choosing one discipline, observing all practices attentively, and designing the best tool or tools to fit these practices.
These objections warn against setting no limits to the customization and against presenting the chain as a service to all scholars that will maintained eternally. But academics are not eternal, they are mortal creatures who are supposed to produce new knowledge, not services. On the other hand, these type of arguments can also paralyze creativity and enthusiasm, and the will to collaborate for a common goal. The ideal setting for creating optimal services in a non commercial environment will probably remain a dream. So to push the further development of open source resources we are bound to reach compromises and to take small steps.
The setting in Arezzo was perfect. A mix of nationalities, generations and disciplines engaged in opening up stories about ordinary people, and last but not least, a warm and thoughtful reception by our hosts Silvia Calamai, Francesca Biliotti, Simona Matteini and Caterina Pesce.
For people heading to Arezzo this summer: try La Lancia d’Oro and l’Agania. Readers who want to know more about Oral History and Technology can take a look at the Oral History website curated by Arjan van Hessen and Henk van de Heuvel.
If you are interested in the progress of our effort to create a transcription chain, or are willing to share your experiences with trying out the tools mentioned in this blog, this is the place to be.
On February 6 and 7 CLARIAH WP3 organized a workshop to discuss the application of Linked Data for linguistic research. The workshop that went under the appropriate acronym of LD4LR, invited presentations from a number of foreign experts as also a number of representatives from CLARIN centers that had made some experience using Linked Data in their projects. The workshop concentrated on the perspective of the linguistic researcher that is increasingly confronted with all kinds of information about Linked Data and that needs to know what Linked Data can bring to her research. A number of prominent Dutch linguists were invited to present their current research topics where subsequently our experts could make suggestions on how to apply Linked Data paradigms to the researcher’s benefit. The invited experts that next to being Linked Data experts also are active linguists presented their efforts with Linked Data in the fields of Lexica, Phonetics and Treebanks.
Overall there was sufficient time for good discussions, where the experts tried to avoid too specific terminology and concentrate on user needs. In the round-up summaries from day one and two Sjef Barbiers and Jan Odijk concluded that although interesting things happen with Linked Data in linguistics, it seems not immediately useable for the end-user researcher unless they themselves are very familiar with Linked Data as the invited experts are. To make the potential of Linked Data use benefit a broader group of linguists, we need a better bridge between technologists and researchers. Dedicated pilots in WP3 should stimulate investigation of the usefulness of Linked Data application to different types of linguistic research, esp. Lexical resources (DUELME). From a data provisioning perspective, the benefits of Linked Data for interoperability purposes are clear.
Monday afternoon, 6 February 2017
|13:00 – 13:15||Welcome, (Why this Workshop)||Daan Broeder|
|13:15 – 14:00||Why should I use LD for my research?,
LD in Comparative Syntax
|Nicoline van der Sijs
|14:00 – 15:30||Broad overview: What kind of (L)L(O)D is available? What linguistic research has been done using it?|
|14:00 – 14:30||John McCrae|
|14:30 – 15:00||Steven Moran|
|15:00 – 15:30||Giuseppe Celano|
|15:30 – 16:00||break|
|16:00 – 17:30||Experiences of CLARIAH/CLARIN centers|
|16:00 – 16:30||Antske Fokkens
Willem van Hage
|16:30 – 17:00||Matej Durco|
|17:00 – 17:30||Thomas Eckart
|17:30 – 17:15||Wrap-up day 1||Sjef Barbiers|
Tuesday, 7 February 2017
|09:00 – 09:15||Outlook Day 2||Menzo Windhouwer|
|09:15 – 11:00||Linguistic research case studies|
|09:15 – 09:45||Introduction|
|Marjo van Koppen,
|09:45 – 11:00||How would you use LD for this research? Expert responses||John McCrae
|11:00 – 11:30||break|
|11:30 – 12:00||Dieter van Uytvanck
|12:00 – 13:00||lunch|
|13:00 – 13:30||Linked Data opportunities & limitations||Daan Broeder|
|13:30 – 14:00||Conclusions||Jan Odijk|
On 27 October 2016, the University of Amsterdam opened its doors to The Humanities And Technology Camp (THATCamp).
In recent years the THATCamp formula has crossed the Atlantic and spread over Europe. At THATCamp Amsterdam we came to fully understand the reason for THATCamp’s success: THATCamp is a playful, informal and fun event where programmers and humanities scholars are able to meet, learn about each other's work, toy around with different types of software, and make plans for a collaborative projects in the future.
At THATCamp Amsterdam topics ranged widely: from the web’s unboundedness to the use of crowdsourcing in research, from the spread of cinemas in the Netherlands to the role of machines on the work floor. Linked Open Data practitioners exchanged working techniques, while Art Historians explored best computational research practices and an Amsterdam historical GIS hotspot took shape. In between there was coffee, salad and "broodjes", and by the end of the day new plans had emerged for collaborative work on Amsterdam’s Creative Industries, from various perspectives and on multiple scales.
For anyone organizing a THATCamp, the catch in the formula is that THATCamp does not really want to be organized top-down. As can be read on the official website, THATCamp is an "unconference": it is participatory ("there are no spectators at a THATCamp"), informal (there are "no lengthy proposals, papers, presentations"), productive (the focus is on "collegial work or free-form discussion"), flat structured ("non-hierarchical”), and crucially bottom-up: at THATCamp, the program is created by all participants together, "on the spot" as part of a collective voting session.
For the record: we need not have worried. THATCamp recommends avoiding web-based technology to facilitate the voting, arguing that “the in-person method works well and is fun.” In Amsterdam, this participatory, personal approach of the first session resonated well with the general enthusiastic and constructive attitude of the THATCamp participants. As it turns out, a small collection of post-its, clothespegs, a few sheets of paper and a large dose of enthusiasm and curiosity may just be the perfect toolkit to start a day of collectively exploring the intersections of humanities scholarship and technology.
THATCamp Amsterdam was hosted by the research project Creative Amsterdam: an E-Humanities Perspective (CREATE), at the Amsterdam Centre for Cultural Heritage and Identity. An impression of THATCamp Amsterdam, including a list of session proposals, may be found on the THATCamp Amsterdam webpage and the CREATE blog.
For more information on other events and research projects carried out within the CREATE Program, please visit the CREATE page.
It all seemed rather funny to them, until the very moment they laid eyes upon the prison block. As ‘Team Clariah’ Marieke van Erp (VU, WP3) and Richard Zijdeman (IISG, WP4) participated in the National Library's HackaLOD on 11-12 November. Alongside seven other teams they faced the challenge of building a cool (prototype) application using Linked Open Data made especially available for this event, by the National Library and Heritage partners. It had to be done within 24 hours… Inside a former prison… Here’s their account of the event.
We set out on Friday, somewhat dispirited as our third team mate Melvin Wevers (UU) was caught out by a cold. Upon arrival, it turned out we had two cells: one for hacking and one for sleeping (well more like for a three-hour tossing and turning). As you'd expect, the cells were not exactly cosy, but the organisers had provided goodie bags from which the contents were put to good use and even a Jaw Harp midnight concert.
With that, and our pre-set up plan to tell stories around buildings we set out to build our killer app. We found several datasets that contain information about buildings. The BAG for example contains addresses, geo-coordinates and information about how a building is used (as a shop or a gathering place) and 'mutations' (things that happened to the building). However, what it doesn't contain is building names (for example Rijksmuseum or Wolvenburg), which is contained in the Rijksmonumenten dataset. But the Rijksmonumenten dataset doesn't contain addresses, but as both contain geo-coordinates, they can be linked. Yay for Linked Data!
To tell the stories, we wanted to find some more information in the National Library's newspaper collection. With some help from other hackers we managed to efficiently bring up news articles that mention a particular location. With some manual analysis, we for example found that for Kloveniersburgwal 73 up until 1890 there was a steady stream of ads asking for ‘decent’ kitchen maids, followed by a sudden spike in ads announcing real estate. It turns out a notary had moved in, for which another (not linked) dataset could also provide a marriage license, confirmed by a wedding ad in the newspaper. These sort of stories can give us more insight into what happened in a particular building at a given time.
We have made some steps in starting to analyse these ads automatically to detect these changes in order to automatically generate timelines for locations, but we didn't get that done in 24 hours. However, the audience was sufficiently pleased with our idea for us to win the audience award! (Admittedly to our great surprise, as the other teams' ideas were all really awesome as well). We’re now looking for funding to complete the prototype.
In summary, it was all great fun, not in the least due to great organisation by the National Library as well as the nice ‘bonding’ atmosphere among the teams. So, our lessons learnt:
- prison food is really not that bad (and there was lots of it)
- 24 hours of hacking is heaps of fun
- the data always turn out to behave different from what you'd expect
- isolated from the daily routine, events like these prove crucial to foster new ideas and relations, in order to keep the field in motion.
(by Marieke van Erp)
This year, the 15th International Semantic Web Conference took place in Kobe Japan. The conference itself was 3 days with 3 parallel sessions as well as a 3-hour poster and demo session one evening. The two days prior to the main conference 5 tutorials and 16 workshops took place.
For NLP aficionados there was the the LD4IE (Linked Data for Information Extraction) workshop which I attended on Tuesday morning, the NLP&DBpedia workshop that I co-organised on Tuesday afternoon, the keynote by Kathleen McKeown (Columbia University) on Wednesday and the NLP session in the main conference on Friday. But there were other NLP papers dispersed along the conference programme.
For the CLARIAH community some of the work McKeown presented on computational analysis of novels is probably most relevant. It was also nice to see that more research is moving towards event extraction, for example in the work of Valentina Presutti and Aldo Gangemi (presented at the LD4IE workshop). They presented a new resource called Framester that links up all types of resources such as FrameNet, VerbNet and DBpedia to help describe events. New at the conference was the journal papers track, where I got to present our work on building Event-centric Knowledge Graphs [slides] [paper] to a pretty big room.
Sentiment analysis was also a hot topic, with several interesting papers such as On the Role of Semantics for Detecting pro-ISIS Stances on Social Media by Hassan Saif, Miriam Fernandez, Matthew Rowe and Harith Alani and A Replication Study of the Top Performing Systems in SemEval Twitter Sentiment Analysis by Efstratios Sygkounas, Giuseppe Rizzo, Raphaël Troncy. Incidentally, the last paper was only one of two replication papers in the conference.
There weren’t that many papers this year dealing with humanities research questions. Next year’s conference will take place in Vienna, perhaps CLARIAH can mitigate that?
- 01-06-2019 CLARIN ParlaFormat workshop
- 29-03-2019 Rotterdam Exchange Format Initiative
- 07-06-2018 CLARIAH facilitates!
- 29-10-2017 ISWC Trip Report
- 23-10-2017 Tour de CLARIN: The Netherlands
- 18-10-2017 CLARIAH Media Studies and MIMEHIST in Zürich – A Report
- 16-10-2017 CLARIAH-Tech day blog
- 28-09-2017 CLARIN 2017 Annual Conference
- 29-06-2017 LDK Trip report
- 20-06-2017 Report CLARIAH Linked Data Workshop 2