Tools-to-data: a data-safe research environment
Tools-to-data allows researchers to work with data in a safe environment that prevents data leaks. Instead of sending a copy of the data to the researchers to work with, the researchers send their tool to the environment and receive the results.
Tools-to-data allows researchers to work with data that are currently often unavailable to them, e.g. publications that are protected by copyright law, or datasets with privacy issues. This means that data providers can offer more of their valuable collections to researchers by expanding their services with a tools-to-data-environment. In short: tools-to-data can help improve the results of humanities research by making more data available.
Tools-to-data is a collaboration of KB National Library of the Netherlands, two researchers from the CLARIAH-community and SURF.
Aims of the proof-of-concept
The proof-of-concept was built in order to demonstrate the viability of a tools-to-data-environment. This question has three perspectives: 1) researchers: can they conduct their research without receiving a copy of the data and without even seeing the data? 2) data providers: is the safety of the data sufficiently guaranteed? Which technical and organisational measures are needed? 3) CLARIAH: can the tools-to-data-environment comply with the requirements of the CLARIAH infrastructure?
The proof-of-concept was built to explore the most strict variant of tools-to-data, where the researchers cannot see the data at all and only receive the results of their tools.
Tools-to-data: what is it about?
Humanities researchers often use text and data mining techniques as they try to discover patterns in the data. They typically use the data in bulk by applying computational analysis tools. Usually, researchers receive a copy of the data from a data provider, e.g. KB. They store the copy on their own laptop or server to work with. However, this has two disadvantages.
Firstly, the data providers cannot send a copy of all their data because of copyright law, contracts, privacy issues, etc. Researchers are often pragmatic and restrict their research to data that may be copied to them. This means that there may be large gaps in their sources, e.g. linguistic research that excludes recent publications because of copyright issues.
Secondly, copying the data is cumbersome, impractical and error prone. It burdens the researcher with storing and managing the data. Moreover, it can lead to many copies of data that are hard to keep track of.
Tools-to-data turns things around by bringing the tool to the data. The data remain in a safe environment, where the tool runs as well. The researcher (and the tool) cannot copy the data. The researcher only receives the results of their tool. In the most strict variant, the researcher cannot see the data at all.
Of course the safety of the data needs sufficient guarantees. These can be technical and organisational. Technically, the environment needs to prevent copying of the data by having no internet connection while the tool runs on the data. Organisationally, data providers need to be in charge of the environment. They can check the results of the tools on data leakage before making them available to the researcher.
The proof-of-concept was developed by SURF. They combined and adapted existing components of their environment (the SURF Research Cloud). No effort was spent on building a dedicated user interface.
There was a close collaboration with two researchers, who used the proof-of-concept for real research cases. They described their requirements in much detail and tested several iterations of the proof-of-concept. Their involvement was of crucial importance to the project, because the proof-of-concept could be built and tested realistically.
The simple workflow scheme below shows how the researcher and the data provider interact with the environment. The blue parts of the workflow scheme are implemented in the proof-of-concept. The grey ones are not yet implemented, but they are covered in the requirements.
The proof-of-concept demonstrated that tools-to-data is a viable concept, which deserves to be developed toward a full-fledged environment. The researchers’ experience showed that they can do meaningful research without being able to see the data. And the KB concluded that they remain in control of the data. It allows the KB to make more digital collections available for research, opening up its rich collections without harming the interests of copyright holders and other parties.
Besides the proof-of-concept the project also delivered a description of the requirements for tools-to-data. These describe what is needed for a fully-fledged tools-to-data environment, beyond the proof-of-concept.
The proof-of-concept will be further developed in the SANE project. This project will build a secure data environment for social sciences and humanities. The researcher can analyse sensitive data, but the data provider retains full control. SANE is a collaboration between CLARIAH, ODISSEI and SURF. It will come in two varieties: Tinker and Blind. In Tinker SANE, the researcher can see and manipulate the data in a secure environment. In Blind SANE, the researcher submits an algorithm without being able to see the data. Our tools-to-data is the precursor of Blind SANE.
Tools-to-data can be useful for all kinds of data. Not just data for the humanities, but also for the social sciences, or in fact any kind of data that need a safe environment.