BOUTEF: Bolstering Our Understanding Through an Elaborated Fake News Corpus
- Authors
- Publication Date
- Apr 19, 2024
- Source
- Hal-Diderot
- Keywords
- Language
- English
- License
- Unknown
- External links
Abstract
This article presents BOUTEF, an original and comprehensive corpus of fake news. It encompassescontent in Algerian and Tunisian dialects, Modern Standard Arabic (MSA), French, and English,featuring instances of code-switching between these languages. Moreover, for the Algerian and Tunisiandialects, we have preserved both Latin and Arabic scripts in the dataset. BOUTEF comprises over 3,600fake news posts collected from various social media platforms spanning from 2010 to 2024. This corpusis developed as part of the TRADEF 4 project and is made available to the research community. Eachfake news post in BOUTEF is associated with 16 attributes, providing rich contextual information. Thedata was gathered from Facebook, Twitter, YouTube, and TikTok, reflecting the diverse sources of misinformation.To enhance the depth of our analysis, we introduce a novel labeling scheme consisting of 40categories. This scheme is developed through a thorough examination of the collected corpus, and we havealso retained a tagging process inspired by Claire Wardle’s categorization. BOUTEF not only contributesto the understanding of fake news in multilingual contexts but also provides valuable resources for furtherresearch in this domain.