Metadata in Scientific Publication

Just as a dictionary defines words, metadata is data that describes digital or physical objects. To understand its utility, compare metadata with the labels used in Ancient Greece to describe the content of papyruses, the latter being piled onto shelves in large numbers. The label attached to each papyrus provided a quick overview of its content without having to take them out of the pile or unroll them. Such a system was efficient in Ancient Greece, but today, due to the large amount of digital data available, it is essential to improve the efficiency of classification systems.

Just as a dictionary defines words, metadata is data that describes digital or physical objects. To understand its utility, compare metadata with the labels used in Ancient Greece to describe the content of papyruses, the latter being piled onto shelves in large numbers. The label attached to each papyrus provided a quick overview of its content without having to take them out of the pile or unroll them. Such a system was efficient in Ancient Greece, but today, due to the large amount of digital data available, it is essential to improve the efficiency of classification systems.

This article is a translation of “Les métadonnées de la publication scientifique” available at: http://blog.mysciencework.com/2012/11/21/les-metadonnees-de-la-publication-scientifique.html It was translated from French into English by Mayte Perea López.

 

Most of the digital metadata used today took their inspiration from the referencing methods that existed long before the digital era.
metadata methods  digital era

 

Using metadata for better identification and classification

We have already discussed the case of metadata for the music industry in a series of articles available here. On this blog, the subject we are particularly interested in is scientific publication. As a central tool for the dissemination of the knowledge produced by research, the scientific article is also at the heart of an important trade issue associated with its diffusion and archiving. Scientific articles are the main tool for scientific communication. Their primary purpose is to be exchanged and shared, and to achieve this they need to be indexed and placed in archives and various computer systems.

In order to foster sharing and facilitate the interoperability between different systems, bibliographical standards had to be developed. Most of the digital metadata used today took their inspiration from the referencing methods and cataloguing standards that existed long before the digital era. Each document available in a library had to be described on a bibliographical card including fields like title, author, number of pages, discipline, etc., to be easily identified and located. To meet these needs, a large number of cataloguing standards were created (for instance the Dewey Decimal System, MARC-21, Unimarc, etc.) but they remain, in part, mutually incompatible.

Defining independent generic bibliographical standards for scientific disciplines makes it possible to offer standards for the metadata associated with scientific publications and to broaden the possibilities for sharing them. In 1995, an international workgroup called the Dublin Core Metadata Initiative (DCMI), made up of professionals specialized in disciplines such as library and information science, computer science, and tagging, the museological community, and others, established a number of generic metadata to describe digital resources (videos, images, books, websites, etc.). The Dublin Core describes each resource thanks to the following 15 optional fields: Title, Creator/Author, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights. There are other, much more complex standards, for example MarcXML or the JATS, which is used by PubMed and implemented by the U.S. National Library of Medicine. However, the standards defined by the Dublin Core are by far the most commonly used. Content producers are encouraged to use these standards to describe their products. Metadata is not intended for direct use by the human being; it is not visible to the user, but it helps to develop services related to the processing of scientific documents, for example specialized search engines. The semantic web represents all the practices and standards whose purpose is to enrich the initial data with semantic metadata to produce files that are more suitable for new uses (see Leading the Web to its Full Potential with the Semantic Web).

Metadata for Open Access Scientific Publishing

The standards introduced by the Dublin Core represent an important step forward in the unification of descriptive data sharing formats for digital resources. If every new format defined is intended to meet some specific needs, the question that naturally arises is the following: in the field of scientific publishing, what are these specific needs? Scientists and other people using their publications need, among other things, to find quickly the articles dealing with a subject of study and the corresponding authors. The authors’ institution or research laboratory, as well as the related rights and the release date, are also potentially useful information. If the majority of publishers have already adopted the Dublin Core format, the way fields are filled in can still vary depending on the different sources. For example, an author’s family name and given name can be written using several different formats (“Family Name, G.N.” or “Given Name FAMILY NAME”, for example.). Bibliographical management software is one of the most meaningful applications in terms of centralized use of scientific articles coming from various publishers. But to obtain a consistent metadata base, it is sometimes necessary to correct the errors manually.

 

Today, due to the large amount of digital data available, it is essential to improve the efficiency of classification systems.
 digital data classification systems

Publishing scientific articles in Open Access allows for greater visibility as dissemination is free and can be achieved via a simple Internet connection. The Open Archive Initiative (OAI), whose aim is to promote Open Access through the development of interoperability standards, implemented the OAI-PMH protocol (Open Archives Initiative Protocol for Metadata Harvesting) that facilitates the exchange of information between repositories (archives of scientific publications) and service providers. Service providers are all the institutions that make it possible to use the collected metadata, for example search engines like GoogleScholar or websites such as the social network MyScienceWork. The OAI-PMH protocol, which is invoked over HTTP, searches in article repositories to collect the metadata of scientific documents and possibly to download the text files. Therefore, it is possible for anyone to “harvest” – that is to say, collect – the metadata of the contents of Open Access repositories like PubMed, ArXiv or HAL. Several directories (DOAJ, ROAR) list thousands of Open Access repositories. This system makes it possible to access to significant databases in rather short amounts of time. In most cases, the metadata provided through the OAI-PMH is defined according to the Dublin Core. It is worth noting that Wikipedia is one of the repositories that offer access in OAI-PMH to its data.

Using metadata for scientific data sharing?

The Dublin Core standards are quite simple. They can describe scientific publications regardless of the discipline concerned. As for data sharing, also known as Open Data, additional difficulties related to the diversity of formats make the definition of universal standards much more complex.

Open Data is a concept that is gaining momentum within our institutions and governments. Open Data in science could dramatically change the current functioning of research. In fact, if all the raw data used by scientists were freely accessible, all the actors of society, provided that they have the necessary capacity and knowledge, could potentially conduct research on the same data. The scientific community, in general, would benefit from this sharing. It would simplify the implementation of collaborative work to solve complex problems (see the example of the collaborative mathematics project Polymath). It would open the access to scientific data to a group of people who are today excluded from this system (see the example of the Eurogenes project). Finally, it would strongly favor transparency in the scientific research process. Naturally, it is important to take into account the fact that competition between teams and laboratories limits data opening practices.

Scientific data is obviously very heterogeneous depending on the discipline and the subject being studied.Today, there are no universal standards to represent scientific data and such standards will be very difficult to implement. The mere definition of scientific data has not even been clearly established yet. Standardizing the formats used, first within one discipline, and then in multidisciplinary fields, would enable progress to be made towards that end. The future may well hold new scientific practices, thanks to the progressive release of metadata and scientific data.

 

Many thanks to Iana Atanassova for reviewing this article.

 

Find out more:

Scientific data must be managed publicly (in French): http://owni.fr/2010/09/01/il-faut-gerer-publiquement-les-donnees-scientifiques/

Open data in Science: http://precedings.nature.com/documents/1526/version/1

Michael Nielsen: Open science now! http://www.ted.com/talks/michael_nielsen_open_science_now.html