From 23 until 28 May the biannual Language Resources and Evaluation Conference (LREC) took place in Portorož, Slovenia. LREC is a large conference in our field covering all aspects of language technology. About 1200 people attended (who were all quite happy that the WiFi worked!) and nearly 750 papers were presented (4 parallel oral sessions and 5 poster sessions throughout the conference). So plenty for everyone out there, and naturally this post can only reflect the papers that caught my attention and what I think might be of interest to you.
First all: CLARIAH and CLARIN ERIC were well represented:
Besides a fair amount of attention to sign language (sessions P15 and O30) and less-resourced languages (session P42), there was also attention for historical language use, such as POS-tagging for Historical Dutch by Dieuwke Hupkes and Rens Bod. What I found really nifty is that they use word alignments between contemporary Dutch (for which we have lots of language tools) and historical Dutch to assign the correct POS-tag.
There was also a poster presentation by Maria Sukhareva and Christian Chiarcos on Combining Ontologies and Neural Networks for Analyzing Historical Language Varieties. A Case Study in Middle Low German. Again projections are used (I guess I never had to worry about that working on contemporary text) and I like that it combines machine learning with background information from an ontology to improve performance.
There were lots of interesting resources and frameworks for publishing linguistic resources presented. One where we can learn (and tag onto) our colleagues from the Semantic Web is the Linguistic Linked Open Data Cloud, where linguistic resources can be stored in a uniform format which enables easier (not yet entirely painless) reuse.
Corpus building is a time-consuming task, so I also really liked the The Royal Society corpus: From Uncharted Data to Corpus poster. Whilst the Royal Society dataset interests me anyway, they adopted an approach to build the corpus based on agile software development. Whilst this may not be suitable to every corpus building effort, it may be worthwhile to take notice of and see where we can make our approaches more flexible to publish data faster and use feedback loops to improve it.
Then there were also several datasets convering non-english languages such as the Royal Library 1 Million Captioned Dutch Newspaper Images by Desmond Elliott and Martijn Kleppe, An Open Corpus for Named Entity Recognition in Historic Newspapers by Clemens Neudecker, containing Dutch, French and German newspaper text including historical spellings and Publishing the Trove Newspaper Corpus by Steve Cassidy on the corpus derived from the National Library of Australia's digital archive of newspaper text.
Here, I should also mention the 2nd keynote by Ryan McDonald from Google on "The Language Resource Spectrum: A perspective from Google". In his talk he presented some experiments done at Google on different NLP tasks to figure out whether to put more effort (=money) into annotated data or fancier language models. Whilst some of the results were not that surprising I think it's an interesting to question to ask and we don't always ask ourselves this are researchers because we are 'used to using method X or Y" (at least in my limited experience).
Unfortunately, the poster didn't make it to Slovenia, but the paper on Complementarity, F-score, and NLP Evaluation by Leon Derczynski raises some interesting issues on how we compare systems; when two systems reach the same F-score for example it doesn't mean they perform the same on all aspects of the problem.
<shameless plug>I also got to present our paper on Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job where we looked at the different characteristics of different entity linking benchmark datasets and found that there is still a fair bit of work to do before we are testing different dimensions of the problem.</shameless plug>
All in, LREC was yet again a great, varied three day whirlwind of what's hot and happening in language technology in Europe (and a little bit beyond). After having gotten some sleep and catching up on the papers I didn't get to see, I'm looking forward to LREC 2018!
Marieke van Erp
On Friday July 5, 2013, I visited the workshop Research Infrastructures towards 2020 organized by the EuroRisNet+ project, in Lisbon, Portugal. I also gave a presentation there on CLARIAH and, as requested by the organizers, the organizational challenges it has experienced and is still to face in the context of the National Roadmap for Large Scale Infrastructures.
Interest in this workshop was very high, so high that first a new venue had to be sought to accommodate as many participants as possible, and second, when also this venue was full, it was decided to do a live streaming of the event over the internet (see here for the recording). And all of this while the temperatures in Lisbon rose close to 40 degrees Celsius and a beach would have been a much more attractive option than a workshop on research infrastructures in a hot and busy city!
The main reason for the large interest can probably be traced to the fact that for the first time a call was launched for the Portuguese National Roadmap for Large Scale Infrastructures. But it cannot be the only reason because there were also many attendants from other countries. It is clear that research infrastructures are “hot” and I believe that many expect to obtain funding for their work via infrastructure funds. The EuroRisNet+ project made an inventory of research infrastructure projects which contains almost 300 entries, and the MERIL database contains over 300!
The launch of the Portuguese National Roadmap was a bit of a disappointment, since the procedure was not very clearly defined, there were no clearly defined criteria for evaluation, and no concrete figures for a budget were mentioned (these are expected in about three weeks). Portugal will use European FP7 structural funds to fund this and this implies that the procedure must have finished by this year. The Portuguese CLARIN people (e.g. Antonio Branco) are ready to submit a proposal, and I met some others who will submit a proposal related to DARIAH, so let’s wish them success with their applications!
From the perspective of the Netherlands, two presentations given there are of special importance. The presentation of Philippe Froissard (Deputy Head of Unit, Research Infrastructures, European Commission) sketched the plans for research infrastructures in Horizon 2020, including concrete budget figures. And second, the presentation of Cas Maessen (NWO) sketched the history of the Netherlands National Roadmap but also considerations and on-going discussions for the future of this roadmap. These, and the other presentations of this event, are on-line and can be found here.
In my presentation one of the challenges I mentioned had to do with IPR: how can we get easy and legal access to contemporary textual and audio-visual resources that are copy-right protected. Of course, I did not have a solution for this. Neither did I expect one from the audience. However, I was pleasantly surprised to find an message in my e-mailbox early in the morning with a link to a speech by Neelie Kroes held at LT-Innovate one week earlier, in which, talking about text and data mining, she states that she is “determined to reform the copy-right system to capture the opportunities of the digital age, if necessary including legislative reform”. This is not a solution yet, but at least the problem is addressed at the highest levels in the European Commission!
Temperatures rose even higher in the weekend after the workshop, so the only rational thing to do was to spend my time during these days on the beach and in the (actually quite cold) Atlantic Ocean .
Below is the text that Stef Scagliola wrote to the developers of Bamboo Dirt, after spending two days of browsing through their beautiful registry of digital tools, in search of appropriate tools with tutorials to use in a bachelor 3rd year class on digital literacy:
I am an historian exploring the digital humanities agenda at the Erasmus Studio of the Erasmus University in Rotterdam. I am also a member of the Virtual Center of Competence 2 of the DARIAH initiative, and as such trying to create support for the development of a portal with digital tools that is suitable for teaching DH to bachelor and master students with a non-computer science background. I recently talked to a colleague Christoph Schoch from Wurzburg, and he suggested to write to the developers of Bamboo Dirt and ask them to adjust the structure of the registry to suit the needs of such a course.
My impression is that there is a gap between courses that delve into programming and modelling, and courses that teach the basic terminology of computer and information science. Some lecturers integrate these features into general research methodology of quantitative and qualitative data, or methods of source criticism, but very often it is not part of the curriculum.
In Rotterdam we are developing such a course in the form of a interfacultary minor. I think Bamboo Dirt is a wonderful registry of all possible tools, but not suitable for the purpose of conveying basic knowledge on how digital tools work in a teaching environment. What I suggest is an environment with a selection of tools arranged along the sequence of the research process:
- searching archives for suitable data or literature
- processing your own data or reusing data from some else
- presenting the result of your research
- curating the data for long term preservation (see: http://eprints.eemcs.utwente.nl/20868 )
I would like to be able to select features of tools at the top level of the registry (open access etc) , and not within a specific category, and these would be the ideal criteria for selection:
- open access
- direct relation with research process (this means leaving out everything that has to do with cataloguing, creating online content, archiving, curating)
- availability of a video tutorial
- availability of cleaned data sets that can be downloaded and used in class (variety of sources: text, numerical, audio-visual, photography, social media-data) or links to places where these can be found
- an opening page with a clear overview
- possibility to skip complicated register, login/pass word procedures
- possibility to gradually develop best practices page with tips for educators (links to suitable data sets in different languages!)
These insights gradually developed as I tried to select a number of tools that could be integrated into a teaching portal and that would suit our course, by systematically scanning all the categories of the Bamboo Dirt registry. It took me two days, and I only got through the first couple of categories. I realized I was trying to make sense out of a telephone book. In a way, you more or less already have to know what you are looking for. The magnitude of what is available is an obstacle to assessing what the best choice is on the basis of thorough knowledge of the content of each tool. Initially I wanted to create the portal for our course this coming year, with the help and feedback from colleagues from Denmark and Austria, but I gradually came to the conclusion that it is too ambitious within the time frame and the available means. In general my impression is that DH is a great field, it attracts enthusiastic people who are willing to share, but its inclusiveness (library, studies, archivists, designers, artists, information studies) and democratic nature has a downside, as it creates a deluge of perspectives and tools, and lack of authority on what yields the best possible result. This need for clarity may be a 'generation' thing, I was born in 1958, but my experience is that many researchers share this "Alice in Digital Wonderland' sensation, exciting, but disorientating.
Dr. Stef Scagliola
Erasmus University Rotterdam
- 29-10-2017 ISWC Trip Report
- 23-10-2017 Tour de CLARIN: The Netherlands
- 18-10-2017 CLARIAH Media Studies and MIMEHIST in Zürich – A Report
- 16-10-2017 CLARIAH-Tech day blog
- 28-09-2017 CLARIN 2017 Annual Conference
- 29-06-2017 LDK Trip report
- 20-06-2017 Report CLARIAH Linked Data Workshop 2
- 25-05-2017 Catching Speech in Arezzo: A Clarin workshop for developing a transcription-chain for Oral History
- 23-02-2017 LD4LR: Linked Data for Linguistic Research
- 17-11-2016 THATCamp Amsterdam 2016: happy afterthoughts