Today, many scientific fields can be described as data-intensive disciplines, which turn raw data into information and then knowledge. If this sounds familiar it’s because this represents the late and influential computer scientist Jim Gray’s vision of the fourth research paradigm. Gray divided up the evolution of science into four periods or paradigms. One thousand years ago, science was experimental in nature, a few hundred years ago it became theoretical, a few decades ago it moved to a computational discipline, and today it’s data driven. Researchers are reliant on e-science tools to enable collaboration, federation, analysis, and exploration to address this data deluge, equal to about 1.2 zettabytes each year. If 11 ounces of coffee equaled one gigabyte, a zettabyte would be the same volume as the Great Wall of China.
This article was originally published in International Science Grid This Week as "Enabling knowledge creation in data-driven science" http://www.isgtw.org/feature/enabling-knowledge-creation-data-driven-science ISGTW is an international weekly online publication that covers distributed computing and the research it enables.
The amount of data is so much that journals such as Neuroscience have stopped accepting supplementary files along with research manuscripts in order to better handle the peer review process. To answer this problem, some are creating infrastructures and software that are set to radically transform the way scientific publishing is done, which has been little changed for centuries.
Research publishing 2.0
While a number of scientific institutes, European Commission-funded projects, and research communities work on establishing common data policies and open-access infrastructures to make research data more searchable, shareable, and citable, the life sciences are looking at data analysis and publishing approaches that move the computer to the data rather than moving the data to the computers.
The GigaScience journal is creating an open-access data platform that combines software workflows, databases, and cloud computing to make all stages of scientific research computable - a first for genomics and biomedical data. The journal, a collaboration with BGI, in Shenzhen China and the journal BioMed Central, aims to turn research papers into executable data objects. With this, for example, a researcher could, in theory, create a separate methods publication from a research paper, with its own Digital Object Identifier (DOI) that can be indexed and cited.
Scott Edmunds, an editor of the GigaScience journal says, “You don’t need methods sections in a paper anymore as methods can be computed, making it easier for reviewers to check data.” To enable this new journal platform, GigaScience is using an open-source workflow system called Galaxy, which is run by a team at Penn State University, Pennsylvania, US. For those in the know, this is the same team that was involved in the data release of the ENCODE project in September 2012. ENCODE published 30 research papers about the function of the human genome simultaneously. Peer review of the project’s findings were made easier with virtual machines, which contained terabytes of input data, code, scripts, processing steps, and outputs.
“ENCODE is state of the art,” says Edmunds. “They set everything up in the Amazon cloud. James Taylor of Penn State University who set up their virtual peer-review environment paid for the Amazon charges himself. They came to $5,000, which, considering the size of the data and the number of people is quite cheap.”
If we build it, will they come?
Last year, the GigaScience journal launched their first citable DOI dataset on the nasty E. coli genome. The information was released on Twitter with a creative commons license and in a pre-publication citable format. Upon release, researchers around the world started producing their own assemblies and annotations, sharing data on Twitter, with some dubbing it as the first ‘Tweenome’.
Currently, the journal stores about 20 terabytes of public data on its servers and another 5-10 terabytes are being prepared for release. They have already released a number of peer-reviewed papers and are hard at work finalizing their data platform, enabling analysis and recreation of experiments on Galaxy in the next few months.
“When we’re confident it’s shiny and ready for prime time we’ll publish something. I’d like to say our platform will answer everything, but it’s really just another step in the right direction,” says Edmunds. “We have to see if people actually use the infrastructure. People have been used to publishing research papers in a certain way for three centuries. Some people will get our platform and use it immediately. But, I think it will take a while for the rest of the community to get used to it.”
Flipping publishing on its head
Data can be represented as a nanopublication, discussed in a 2011 Nature article. This is the smallest unit of publication of a single assertion associating two concepts based on a foundation in machine-readable form, with associated metadata and its own open-source universally unique identification ID or UUID. GigaScience’s data may be included in this approach.
“Converting data from classical relational databases or spreadsheets into nanopublications is relatively easy, while harvesting them from classical narrative is the main issue,” says Barend Mons, scientific director of the Netherlands Bioinformatics Centre.
There are dozens of projects converting major datasets globally, creating new semantic software and new terminology. These projects include nanopub.org, which helps researchers create nanopublications of their data. The Phortos Group is a collaboration of companies and researchers providing services to big data owners. They aim to develop applications to find associations or semantic relationships in nanopublications too difficult for any one human brain to analyze alone.
Mons says research in his biosemantics group has already inferred new discoveries from what they refer to as the 'explicitome' or all explicit assertions they can find. They estimate that the explictome of the biomedical life sciences currently consists of 100 trillion nanopublications.
By grouping similar assertions under a summarized or cardinal assertion, 100 trillion nanopublications can be reduced to 200 billion unique assertions, reduced further to less than two million concepts, a manageable amount of data, says Mons.
“From the explictome we can infer novel protein-protein interactions for example and there is more to come. Our work may go beyond just dealing with big data, as we advance e-science publishing (or publishing 2.0) and also evidence-aware computer reasoning. We’re moving into studies on the very nature of human knowledge discovery and this will give some very exciting insights,” says Mons.
A distributed community of communities
A bigger question is could a software approach that serves one research community be transferable to another community? The ScienceSoft project is trying to address this question. It’s building a catalogue of software products for all research communities. But, it’s also much more than that says Alberto Di Meglio, project director of the European Middleware Initiative.
ScienceSoft aims to be a distributed community of communities that use scientific software, addressing issues such as software identification, the long-term preservation of software information, and relationships between software, publications, and data sets.
Information that’s uniquely identifiable can be used as a knowledge base across different research areas and advance reliability of peer-review.
“We’re not the only ones thinking of unique identifiers for digital objects, but as far as we know, most of this research has been done in individual categories of objects,” says Di Meglio. “Today, the way data sets and software are referred to in research papers do not yet allow other people to reproduce the results in a consistent and easily accessible way. Creating links across knowledge is essential. Science today is distributed and the advantages of cross-disciplinary research are becoming apparent,” says Di Meglio.
In one example, Mike Conlon, principal investigator of the VIVO project gave a keynote speech about new scientific results that were made possible by linking two completely different research areas at the 8th IEEE International Conference on eScience 2012 .
Di Meglio says the vision is to create a semantic system that will help researchers find information quickly, spot similarities across disciplines, avoid errors, and increase collaboration. “It’s also important for decision makers too because they can understand what areas require more attention and funding, maybe creating a more measurable relationship between research projects and funding agencies.”
The ScienceSoft project is looking to include various science disciplines in this discussion and it’s organizing a workshop in January 2013 on the topic of unique identifiers for software digital objects.
About the author:
Adrian Giordani has a Masters in Science Communication from Imperial College London, where he was also the Editor-in-Chief of I, Science magazine. He was a science journalist and Interim Editor-in-Chief at the publication, International Science Grid This Week, in CERN, Geneva, Switzerland. He writes about health, big data, software, supercomputing, and all forms of large-scale computing. You can follow him on Twitter (@Speakster).
By the same author:
Genome researchers find software used to piece together genetic codes are not reliable
Staying ahead of a superstorm with open-data visualizations
Similar articles on MyScienceWork:
GigaScience: open access to manuscripts plus datasets and codes!
Metadata in Scientific Publication
Science of Team Science