2017
Authors
Sousa, R; Gama, J;
Publication
Proceedings of the Workshop on IoT Large Scale Learning from Data Streams co-located with the 2017 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2017), Skopje, Macedonia, September 18-22, 2017.
Abstract
A comparison between co-training and self-training method for single-target regression based on multiples learners is performed. Data streaming systems can create a significant amount of unlabeled data which is caused by label assignment impossibility, high cost of labeling or labeling long duration tasks. In supervised learning, this data is wasted. In order to take advantaged from unlabeled data, semi-supervised approaches such as Co-training and Self-training have been created to benefit from input information that is contained in unlabeled data. However, these approaches have been applied to classification and batch training scenarios. Due to these facts, this paper presents a comparison between Co-training and Self-learning methods for single-target regression in data streams. Rules learning is used in this context since this methodology enables to explore the input information. The experimental evaluation consisted of a comparison between the real standard scenario where all unlabeled data is rejected and scenarios where unlabeled data is used to improve the regression model. Results show evidences of better performance in terms of error reduction and in high level of unlabeled examples in the stream. Despite this fact, the improvements are not expressive.
2017
Authors
Sousa, R; Gama, J;
Publication
Foundations of Intelligent Systems - 23rd International Symposium, ISMIS 2017, Warsaw, Poland, June 26-29, 2017, Proceedings
Abstract
In a single-target regression context, some important systems based on data streaming produce huge quantities of unlabeled data (without output value), of which label assignment may be impossible, time consuming or expensive. Semi-supervised methods, that include the co-training approach, were proposed to use the input information of the unlabeled examples in the improvement of models and predictions. In the literature, the co-training methods are essentially applied to classification and operate in batch mode. Due to these facts, this work proposes a co-training online algorithm for single-target regression to perform model improvement with unlabeled data. This work is also the first-step for the development of online multi-target regressor that create models for multiple outputs simultaneously. The experimental framework compared the performance of this method, when it rejects unalabeled data and when it uses unlabeled data with different parametrization in the training. The results suggest that the co-training method regressor predicts better when a portion of unlabeled examples is used. However, the prediction improvements are relatively small. © Springer International Publishing AG 2017.
2017
Authors
Jorge, AM; Vinagre, J; Domingues, M; Gama, J; Soares, C; Matuszyk, P; Spiliopoulou, M;
Publication
E-COMMERCE AND WEB TECHNOLOGIES, EC-WEB 2016
Abstract
Given the large volumes and dynamics of data that recommender systems currently have to deal with, we look at online stream based approaches that are able to cope with high throughput observations. In this paper we describe work on incremental neighborhood based and incremental matrix factorization approaches for binary ratings, starting with a general introduction, looking at various approaches and describing existing enhancements. We refer to recent work on forgetting techniques and multidimensional recommendation. We will also focus on adequate procedures for the evaluation of online recommender algorithms.
2017
Authors
Vinagre, J; Jorge, AM; Gama, J;
Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2017)
Abstract
Online recommender systems often deal with continuous, potentially fast and unbounded flows of data. Ensemble methods for recommender systems have been used in the past in batch algorithms, however they have never been studied with incremental algorithms that learn from data streams. We evaluate online bagging with an incremental matrix factorization algorithm for top-N recommendation with positiveonly user feedback, often known as binary ratings. Our results show that online bagging is able to improve accuracy up to 35% over the baseline, with small computational overhead.
2017
Authors
Nogueira, DM; Ferreira, CA; Jorge, AM;
Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2017)
Abstract
Phonocardiogram signals contain very useful information about the condition of the heart. It is a method of registration of heart sounds, which can be visually represented on a chart. By analyzing these signals, early detections and diagnosis of heart diseases can be done. Intelligent and automated analysis of the phonocardiogram is therefore very important, to determine whether the patient's heart works properly or should be referred to an expert for further evaluation. In this work, we use electrocardiograms and phonocardiograms collected simultaneously, from the Physionet challenge database, and we aim to determine whether a phonocardiogram corresponds to a "normal" or "abnormal" physiological state. The main idea is to translate a 1D phonocardiogram signal into a 2D image that represents temporal and Mel-frequency cepstral coefficients features. To do that, we develop a novel approach that uses both features. First we segment the phonocardiogram signals with an algorithm based on a logistic regression hidden semi-Markov model, which uses the electrocardiogram signals as reference. After that, we extract a group of features from the time and frequency domain (Mel-frequency cepstral coefficients) of the phonocardiogram. Then, we combine these features into a two-dimensional time-frequency heat map representation. Lastly, we run a binary classifier to learn a model that discriminates between normal and abnormal phonocardiogram signals. In the experiments, we study the contribution of temporal and Mel-frequency cepstral coefficients features and evaluate three classification algorithms: Support Vector Machines, Convolutional Neural Network, and Random Forest. The best results are achieved when we map both temporal and Mel-frequency cepstral coefficients features into a 2D image and use the Support Vector Machines with a radial basis function kernel. Indeed, by including both temporal and Mel-frequency cepstral coefficients features, we obtain sligthly better results than the ones reported by the challenge participants, which use large amounts of data and high computational power.
2017
Authors
Campos, R; Dias, G; Jorge, AM; Nunes, C;
Publication
INFORMATION RETRIEVAL JOURNAL
Abstract
Despite a clear improvement of search and retrieval temporal applications, current search engines are still mostly unaware of the temporal dimension. Indeed, in most cases, systems are limited to offering the user the chance to restrict the search to a particular time period or to simply rely on an explicitly specified time span. If the user is not explicit in his/her search intents (e.g., "philip seymour hoffman'') search engines may likely fail to present an overall historic perspective of the topic. In most such cases, they are limited to retrieving the most recent results. One possible solution to this shortcoming is to understand the different time periods of the query. In this context, most state-of-the-art methodologies consider any occurrence of temporal expressions in web documents and other web data as equally relevant to an implicit time sensitive query. To approach this problem in a more adequate manner, we propose in this paper the detection of relevant temporal expressions to the query. Unlike previous metadata and query log-based approaches, we show how to achieve this goal based on information extracted from document content. However, instead of simply focusing on the detection of the most obvious date we are also interested in retrieving the set of dates that are relevant to the query. Towards this goal, we define a general similarity measure that makes use of co-occurrences of words and years based on corpus statistics and a classification methodology that is able to identify the set of top relevant dates for a given implicit time sensitive query, while filtering out the non-relevant ones. Through extensive experimental evaluation, we mean to demonstrate that our approach offers promising results in the field of temporal information retrieval (T-IR), as demonstrated by the experiments conducted over several baselines on web corpora collections.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.