Affordable Access

Publisher Website

Using machine learning to disentangle homonyms in large text corpora.

  • Roll, Uri1, 2
  • Correia, Ricardo A2, 3, 4
  • Berger-Tal, Oded1
  • 1 Mitrani Department of Desert Ecology, The Jacob Blaustein Institutes for Desert Research, Ben-Gurion University of the Negev, Midreshet Ben-Gurion 8499000, Israel. , (Israel)
  • 2 School of Geography and the Environment, University of Oxford, Oxford, OX13QY, UK.
  • 3 Institute of Biological Sciences and Health, Federal University of Alagoas, Campus A. C. Simões, Av. Lourival Melo Mota, s/n Tabuleiro dos Martins, Maceió, AL, Brazil. , (Brazil)
  • 4 DBIO & CESAM-Centre for Environmental and Marine Studies, University of Aveiro, Aveiro, Portugal. , (Portugal)
Published Article
Conservation Biology
Wiley (Blackwell Publishing)
Publication Date
Oct 31, 2017
DOI: 10.1111/cobi.13044
PMID: 29086438


Systematic reviews are an increasingly popular decision-making tool which provides an unbiased summary of evidence to support conservation action. These reviews bridge the gap between researchers and managers by presenting a comprehensive overview of all studies relating to a particular topic and identify specifically where and under which conditions an effect is present. However, several technical challenges can severely hinder the feasibility and applicability of systematic reviews. One such challenge is the presence of homonyms - terms that share spelling but differ in meaning. Homonyms add a lot of noise to search results but they cannot be easily identified and removed. In this work, we developed a semi-automated approach that can aid in the classification of homonyms between narratives. We used a combination of automated content analysis and artificial neural networks to quickly and accurately sift through large corpora of academic texts and classify them to distinct topics. As an example, we explored the use of the word 'reintroduction' in academic texts. Reintroduction is used within the conservation context to indicate the release of organisms to their former native habitat, however a 'Web of Science' search using this word returned thousands of publications that use this term with other meanings and contexts. Using our method, we were able to automatically classify a sample of 3000 of these publications with more than 99% accuracy, when compared to a manual classification. Our approach can be easily used with any other homonym terms and greatly facilitate systematic reviews, or any similar cases where homonyms hinder the harnessing of large text corpora. Beyond homonyms we see great promise in the combination of automated content analysis and machine learning methods in handling and screening big data for relevant information in conservation science. This article is protected by copyright. All rights reserved.

Report this publication


Seen <100 times