Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2021

Current Trends in Learning from Data Streams

Autores
Gama, J; Veloso, B; Aminian, E; Ribeiro, RP;

Publicação
9TH INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS, BDA 2021

Abstract
This article presents our recent work on the topic of learning from data streams. We focus on emerging topics, including fraud detection, learning from rare cases, and hyper-parameter tuning for streaming data.

FecharLer Abstract

2021

Hyper-parameter Optimization for Latent Spaces

Autores
Veloso, B; Caroprese, L; König, M; Teixeira, S; Manco, G; Hoos, HH; Gama, J;

Publicação
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT III

Abstract
We present an online optimization method for time-evolving data streams that can automatically adapt the hyper-parameters of an embedding model. More specifically, we employ the Nelder-Mead algorithm, which uses a set of heuristics to produce and exploit several potentially good configurations, from which the best one is selected and deployed. This step is repeated whenever the distribution of the data is changing. We evaluate our approach on streams of real-world as well as synthetic data, where the latter is generated in such way that its characteristics change over time (concept drift). Overall, we achieve good performance in terms of accuracy compared to state-of-the-art AutoML techniques.

FecharLer Abstract

2021

Chebyshev approaches for imbalanced data streams regression models

Autores
Aminian, E; Ribeiro, RP; Gama, J;

Publicação
DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
In recent years data stream mining and learning from imbalanced data have been active research areas. Even though solutions exist to tackle these two problems, most of them are not designed to handle challenges inherited from both problems. As far as we are aware, the few approaches in the area of learning from imbalanced data streams fall in the context of classification, and no efforts on the regression domain have been reported yet. This paper proposes a technique that uses sampling strategies to cope with imbalanced data streams in a regression setting, where the most important cases have rare and extreme target values. Specifically, we employ under-sampling and over-sampling strategies that resort to Chebyshev's inequality value as a heuristic to disclose the type of incoming cases (i.e. frequent or rare). We have evaluated our proposal by applying it in the training of models by four well-known regression algorithms over fourteen benchmark data sets. We conducted a series of experiments with different setups on both synthetic and real-world data sets. The experimental results confirm our approach's effectiveness by showing the models' superior performance trained by each of the sampling strategies compared with their baseline pairs.

FecharLer Abstract

2021

Dynamic Topic Modeling Using Social Network Analytics

Autores
Tabassum, S; Gama, J; Azevedo, P; Teixeira, L; Martins, C; Martins, A;

Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2021)

Abstract
Topic modeling or inference has been one of the well-known problems in the area of text mining. It deals with the automatic categorisation of words or documents into similarity groups also known as topics. In most of the social media platforms such as Twitter, Instagram, and Facebook, hashtags are used to define the content of posts. Therefore, modelling of hashtags helps in categorising posts as well as analysing user preferences. In this work, we tried to address this problem involving hashtags that stream in real-time. Our approach encompasses graph of hashtags, dynamic sampling and modularity based community detection over the data from a popular social media engagement application. Further, we analysed the topic clusters' structure and quality using empirical experiments. The results unveil latent semantic relations between hashtags and also show frequent hashtags in a cluster. Moreover, in this approach, the words in different languages are treated synonymously. Besides, we also observed top trending topics and correlated clusters.

FecharLer Abstract

2021

Spatiotemporal Road Traffic Anomaly Detection: A Tensor-Based Approach

Autores
Tisljaric, L; Fernandes, S; Caric, T; Gama, J;

Publicação
APPLIED SCIENCES-BASEL

Abstract
The increased development of urban areas results in a larger number of vehicles on the road network, leading to traffic congestion, which often leads to potentially dangerous situations that can be described as anomalies. The tensor-based methods emerged only recently in applications related to traffic anomaly detection. They outperform other models regarding simultaneously capturing spatial and temporal components, which are of immense importance in traffic dataset analysis. This paper presents a tensor-based method for extracting the spatiotemporal road traffic patterns represented with the speed transition matrices, with the goal of anomaly detection. A novel anomaly detection approach is presented, which relies on computing the center of mass of the observed traffic patterns. The method was evaluated on a large road traffic dataset and was able to detect the most anomalous parts of the urban road network. By analyzing spatial and temporal components of the most anomalous traffic patterns, sources of anomalies can be identified. Results were validated using the extracted domain knowledge from the Highway Capacity Manual. The anomaly detection model achieved a precision score of 92.88%. Therefore, this method finds its usages for safety experts in detecting potentially dangerous road segments, urban traffic planners, and routing applications.

FecharLer Abstract

2022

Host-based IDS: A review and open issues of an anomaly detection system in IoT

Autores
Martins, I; Resende, JS; Sousa, PR; Silva, S; Antunes, L; Gama, J;

Publicação
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE

Abstract
The Internet of Things (IoT) envisions a smart environment powered by connectivity and heterogeneity where ensuring reliable services and communications across multiple industries, from financial fields to healthcare and fault detection systems, is a top priority. In such fields, data is being collected and broadcast at high speed on a continuous and real-time scale, including IoT in the streaming processing paradigm. Intrusion Detection Systems (IDS) rely on manually defined security policies and signatures that fail to design a real-time solution or prevent zero-day attacks. Therefore, anomaly detection appears as a prominent solution capable of recognizing patterns, learning from experience, and detecting abnormal behavior. However, most approaches do not fit the urged requirements, often evaluated on deprecated datasets not representative of the working environment. As a result, our contributions address an overview of cybersecurity threats in IoT, important recommendations for a real-time IDS, and a real-time dataset setting to evaluate a security system covering multiple cyber threats. The dataset used to evaluate current host-based IDS approaches is publicly available and can be used as a benchmark by the community.

FecharLer Abstract