The notion of the Semantic Web (commonly associated with the term “web 3.0”), is being used more and more. What is the meaning of this association of the words “Semantic” and “Web”, each coming from quite different disciplines, that is, computing and linguistics? Will the Web of the future be “intelligent” and capable of evaluating the relevance of an answer by analyzing the semantics of the corresponding question? Will it offer the possibility to exploit all the content of the web in a perfectly relevant and precise manner?
This article is a translation of “Le web sémantique : un projet pour amener le web à son plein potentiel” available at: http://blog.mysciencework.com/2012/06/25/web-semantique-projet-amener-le-web-plein-potentie.html It was translated from French into English by Mayte Perea López.
The Semantic Web is a project that was launched in 2001 by Tim Berners Lee1, the inventor of the World Wide Web. It was developed under the aegis of the W3C, an organization that standardizes the computer formats used on the Internet. Originally and since then, its ambition has been to develop a group of technologies to systematically describe and exploit the semantics of web resources. Without questioning the technological foundations of the current web, its aim is to extend and improve the structuration of its contents.
Semantic Web and Ontologies - source: Samuel Huron/Flikr
Resources and their Semantics
To understand its basic concepts, let’s try first to define the notions of “resource” and “semantics”.
The Web today contains a series of pages localized by virtual addresses known as URLs (Uniform Resource Locator) and other objects identified by URNs (Uniform Resource Names). A resource, uniquely identified by these virtual addresses, can refer to a simple HTML page appearing on a browser, but also to an image, a video, a paragraph of text, a wikipedia definition, etc. Today’s web has limitations because a computer that has to process URLs is not able to access the content of the associated resources to carry out more accurate computer processing. For example, we would like a computer to be able to answer correctly and automatically to another computer giving the following order: “Send me an image of a dog”. For this to happen, both programs need to possess additional information on the images they have at their disposal. It is at this level that the notion of semantics arises [1].
Semantics can be characterized by the triple “resource, agent, concept”, in which a resource is defined as previously (e.g. an image), an agent can be a simple program running on a computer and a concept is the term used by the Semantic Web to refer to the information associated with a resource that defines the category to which the latter belongs. So, basically, semantics consists of a relationship of interpretation [2]. That is, in simple terms: “What does the term Y mean for X?” One possible answer: “This image, for this given program, is a dog!” The principle is already being used in computer science with metadata and TAGs, among others. Metadata are data embedded within a file that describe it and do not appear on the screen when the file is viewed. They are the basis for archiving and cataloging. The purpose of the Semantic Web is their systematic, structured and normalized use.
This construction of data is a principle that comes into its own and makes perfect sense in the context of the web, in which resources are fully distributed. The needed data is not always available locally. So, a system must be able to implement automatic processes of exploration and interrogation. For example, two agent software programs can share data remotely, but only if they follow the same description system and the same conventions for semantic interpretation of the objects they use (they have to speak the same language). Hence the importance of standardization and the role of the W3C (World Wide Web Consortium) or the different industrial standards like ISO (International Organization for Standardization) to ensure this interoperability.
Ontologies
This description system based on the notion of concept as it was previously explained, actually consists of a network of concepts known as “ontology”, in reference to the discipline of philosophy developed by Aristotle that focuses on investigating the nature of being. Nevertheless, for the Semantic Web, in its technological dimension, ontology is not an investigation program, but a mathematical and computer object that has a specific definition.
The concepts of an ontology are all interrelated through different types of relationships, such as the specification relationship (e.g. the dogs category is a specification of the mammals category), or relationships that are specific to a field such as “X is in Y”, “X is the author of Y”, “X is the title of Y”, etc. Ontologies express relationships that are generally verified between the categories of entities in a given field. Thus, if an ontology specifies that the concept referring to the dogs category is a specification of the concept relating to mammals, then a system can easily exploit this new information, for example by sending back the image of a dog when the request submitted to a semantic search engine is: “Send me an image of a mammal.” This way, these mechanisms equip the Web with reasoning capacities that help to improve the relevance of the answers in an Internet search.
Some simple ontologies already exist for specific fields: FOAF (Friend Of A Friend) describes the identity and the social links between people and DublinCore describes multimedia digital resources (title, author, format, etc.). Both ontologies consist of an extended and normalized XML vocabulary. The current most significant ontology is the “Gene Ontology”. Stemming from a bioinformatics project, its aim is to standardize the representation of genes in all species to make the interoperability of databases possible.
Dissemination factors of the Semantic Web
One of the preconditions to implementing, on a large scale, websites that comply with the norms of the Semantic Web is to know how to link the right metadata (apply the right semantic categorization) to the contents. And yet, looking at the amount of data available on the Internet, it is difficult to imagine this operation without the appropriate tools. At least two solutions can be considered: 1) Associating additional functions with publishing and creation tools for web content. In this option, producers (for example, writers) will be able to decide on their own which semantic concepts describe their production the best. 2) Using automatic analysis tools for content, whose task afterwards will be to carry out semantic categorization. For example, there are tools for speech analysis (going from oral to written language), image analysis (for example identifying a photo as the representation of a dog) or automatic text analysis (using lexicon and syntax to make sentences “comprehensible”) as most of the web content is in text form. Regarding the automatic processing of the requests expressed in natural language in search engines, the linguistic analysis is not very accurate, as search engines are based on keywords and do not take into account phenomena such as homonymy, synonymy, expressions, etc. All these technical fields of research and development are factors that participate in the dissemination of this technology and will continue to do so in the future.
From an IT perspective, and more concretely, an ontology is a file (XML) associated with the content that it aims to describe. Actually, the W3C recommends a series of languages that vary in expressiveness: RDF (Resource Description Framework), OWL (Ontology Web Language), etc. In terms of tools, a number of ontology editors make it possible to manipulate RDF/OWL code: Protégé or Altova SemanticWorks, but also APIs (Application Programming Interface) such as Jena or OWL-API. As for queries, SPARQL is an RDF query language whose underlying data model is a graph. In addition, tools like Fact++ or Pellet are automated reasoners that are able to evaluate the weight of a group of assertions or make certain types of inferences. Among industrial actors in France involved in this technological environment, are, for example, Exalead (search engine), Arisem, Temis or Mondeca (automatic monitoring applications, electronic document management and automatic language processing). In the United States, Google and Oracle are also involved.
Optimizing the Semantic Web boom will be possible only if standards are overwhelmingly adopted. For this to happen, there are several preconditions we need to comply with:
- Making a standardization effort
- Facilitating the description and allocation of semantic descriptions to resources
- Ensuring compatibility with existing resources.
To go further into the technical aspects, visit, for example, the website of W3C or read Pascal Hitzler, Markus Krötzsch, and Sebastian Rudolph, Foundations of Semantic Web Technologies, 1st ed. (Chapman and Hall/CRC, 2009).
To conclude, the Semantic Web aims at facilitating the processing of structured web data and their exploitation by machines. Considered, in 2001, the logical evolution of the web by its creator, it still requires an important standardization effort on the global scale. It is an ambitious project in which important progress has already been made, but there is still room for improvement. The hope is still alive that someday we’ll obtain a web that will meet our needs in a relevant manner, while optimizing the management of the content made available online.
Sources: T. Berners-Lee, J. Hendler, et O. Lassila, « The semantic web », Scientific American 284, no. 5 (2001): 34-43. Franz Baader, The description logic handbook: theory, implementation, and applications (Cambridge Univ Pr, 2003). http://www.w3.org/TR/2009/REC-owl2-primer-20091027/
Notes:
[1] http://www.semantique-gdr.net
[2] This question about the meaning has its equivalent in the philosophy of the mind, through the work of the American philosopher John Searle who refuted the conception that the human mind was modeled on the structure of the computer. According to him, a machine, as opposed to a human being, is not able to interpret or understand the symbols it manipulates. Even though this remains intrinsically impossible for a machine to reach the level of interpretation and comprehension of a human being, the Semantic Web project aims at reducing the gap between them. As for the way the human mind defines meaning, it has yet to be investigated by cognitive sciences.
Find out more:
Serge Abiteboul’s lecture at Collège de France (in French) http://www.college-de-france.fr/site/serge-abiteboul/le-web-semantique.htm
Web sémantique et web social : nouvelles pratiques de recherche et circulation des savoirs sur le web 3.0 http://www.centre-dalembert.u-psud.fr/index.php?option=com_content&view=article&id=319