Publications

Publications by Luís Torgo

2018

Contributions to the Detection of Unreliable Twitter Accounts through Analysis of Content and Behaviour

Authors
Guimarães, N; Figueira, A; Torgo, L;

Publication
Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2018, Volume 1: KDIR, Seville, Spain, September 18-20, 2018.

Abstract
Misinformation propagation on social media has been significantly growing, reaching a major exposition in the 2016 United States Presidential Election. Since then, the scientific community and major tech companies have been working on the problem to avoid the propagation of misinformation. For this matter, research has been focused on three major sub-fields: the identification of fake news through the analysis of unreliable posts, the propagation patterns of posts in social media, and the detection of bots and spammers. However, few works have tried to identify the characteristics of a post that shares unreliable content and the associated behaviour of its account. This work presents four main contributions for this problem. First, we provide a methodology to build a large knowledge database with tweets who disseminate misinformation links. Then, we answer research questions on the data with the goal of bridging these problems to similar problem explored in the literature. Next, we focus on accounts which are constantly propagating misinformation links. Finally, based on the analysis conducted, we develop a model to detect social media accounts that spread unreliable content. Using Decision Trees, we achieved 96% in the F1-score metric, which provides reliability on our approach. Copyright 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

CloseRead Abstract

2018

Current State of the Art to Detect Fake News in Social Media: Global Trendings and Next Challenges

Authors
Figueira, A; Guimarães, N; Torgo, L;

Publication
Proceedings of the 14th International Conference on Web Information Systems and Technologies, WEBIST 2018, Seville, Spain, September 18-20, 2018.

Abstract
Nowadays, false news can be created and disseminated easily through the many social media platforms, resulting in a widespread real-world impact. Modeling and characterizing how false information proliferates on social platforms and why it succeeds in deceiving readers are critical to develop efficient algorithms and tools for their early detection. A recent surge of researching in this area has aimed to address the key issues using methods based on machine learning, deep learning, feature engineering, graph mining, image and video analysis, together with newly created data sets and web services to identify deceiving content. Majority of the research has been targeting fake reviews, biased messages, and against-facts information (false news and hoaxes). In this work, we present a survey on the state of the art concerning types of fake news and the solutions that are being proposed. We focus our survey on content analysis, network propagation, fact-checking and fake news analysis and emerging detection systems. We also discuss the rationale behind successfully deceiving readers. Finally, we highlight important challenges that these solutions bring. Copyright

CloseRead Abstract

2018

Resampling with neighbourhood bias on imbalanced domains

Authors
Branco, P; Torgo, L; Ribeiro, RP;

Publication
EXPERT SYSTEMS

Abstract
Imbalanced domains are an important problem that arises in predictive tasks causing a loss in the performance on the most relevant cases for the user. This problem has been extensively studied for classification problems, where the target variable is nominal. Recently, it was recognized that imbalanced domains occur in several other contexts and for multiple tasks, such as regression tasks, where the target variable is continuous. This paper focuses on imbalanced domains in both classification and regression tasks. Resampling strategies are among the most successful approaches to address imbalanced domains. In this work, we propose variants of existing resampling strategies that are able to take into account the information regarding the neighbourhood of the examples. Instead of performing sampling uniformly, our proposals bias the strategies to reinforce some regions of the data sets. With an extensive set of experiments, we provide evidence of the advantage of introducing a neighbourhood bias in the resampling strategies for both classification and regression tasks with imbalanced data sets.

CloseRead Abstract

2018

The Utility Problem of Web Content Popularity Prediction

Authors
Moniz, N; Torgo, L;

Publication
Proceedings of the 29th on Hypertext and Social Media, HT 2018, Baltimore, MD, USA, July 09-12, 2018

Abstract
The ability to generate and share content on social media platforms has changed the Internet. With the growing rate of content generation, efforts have been directed at making sense of such data. One of the most researched problem concerns predicting web content popularity. We argue that the evolution of state-of-the-art approaches has been optimized towards improving the predictability of average behaviour of data: items with low levels of popularity. We demonstrate this effect using a utility-based framework for evaluating numerical web content popularity prediction tasks, focusing on highly popular items. Additionally, it is demonstrated that gains in predictive and ranking ability of such type of cases can be obtained via naïve approaches, based on strategies to tackle imbalanced domains learning tasks. © 2018 Association for Computing Machinery.

CloseRead Abstract

2018

MetaUtil: Meta Learning for Utility Maximization in Regression

Authors
Branco, P; Torgo, L; Ribeiro, RP;

Publication
Discovery Science - 21st International Conference, DS 2018, Limassol, Cyprus, October 29-31, 2018, Proceedings

Abstract
Several important real world problems of predictive analytics involve handling different costs of the predictions of the learned models. The research community has developed multiple techniques to deal with these tasks. The utility-based learning framework is a generalization of cost-sensitive tasks that takes into account both costs of errors and benefits of accurate predictions. This framework has important advantages such as allowing to represent more complex settings reflecting the domain knowledge in a more complete and precise way. Most existing work addresses classification tasks with only a few proposals tackling regression problems. In this paper we propose a new method, MetaUtil, for solving utility-based regression problems. The MetaUtil algorithm is versatile allowing the conversion of any out-of-the-box regression algorithm into a utility-based method. We show the advantage of our proposal in a large set of experiments on a diverse set of domains. © 2018, Springer Nature Switzerland AG.

CloseRead Abstract

2019

Evaluation Procedures for Forecasting with Spatio-Temporal Data

Authors
Oliveira, M; Torgo, L; Costa, VS;

Publication
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2018, PT I

Abstract
The amount of available spatio-temporal data has been increasing as large-scale data collection (e.g., from geosensor networks) becomes more prevalent. This has led to an increase in spatio-temporal forecasting applications using geo-referenced time series data motivated by important domains such as environmental monitoring (e.g., air pollution index, forest fire risk prediction). Being able to properly assess the performance of new forecasting approaches is fundamental to achieve progress. However, the dependence between observations that the spatio-temporal context implies, besides being challenging in the modelling step, also raises issues for performance estimation as indicated by previous work. In this paper, we empirically compare several variants of cross-validation (CV) and out-of-sample (OOS) performance estimation procedures that respect data ordering, using both artificially generated and real-world spatio-temporal data sets. Our results show both CV and OOS reporting useful estimates. Further, they suggest that blocking may be useful in addressing CV's bias to underestimate error. OOS can be very sensitive to test size, as expected, but estimates can be improved by careful management of the temporal dimension in training.

CloseRead Abstract