Publications

Publications by Alípio Jorge

2012

Forgetting mechanisms for scalable collaborative filtering

Authors
Vinagre, J; Jorge, AM;

Publication
Journal of the Brazilian Computer Society

Abstract
Collaborative filtering (CF) has been an important subject of research in the past few years. Many achievements have been made in this field, however, many challenges still need to be faced, mainly related to scalability and predictive ability. One important issue is how to deal with old and potentially obsolete data in order to avoid unnecessary memory usage and processing time. Our proposal is to use forgetting mechanisms. In this paper, we present and evaluate the impact of two forgetting mechanisms-sliding windows and fading factors-in user-based and item-based CF algorithms with implicit binary ratings under a scenario of abrupt change. Our results suggest that forgetting mechanisms reduce time and space requirements, improving scalability, while not significantly affecting the predictive ability of the algorithms. © 2012 The Brazilian Computer Society.

CloseRead Abstract

2012

D-Confidence: An active learning strategy to reduce label disclosure complexity in the presence of imbalanced class distributions

Authors
Escudeiro, NF; Jorge, AM;

Publication
Journal of the Brazilian Computer Society

Abstract
In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled instances to train a classifier. In such circumstances it is common to have massive corpora where a few instances are labeled (typically a minority) while others are not. Semi-supervised learning techniques try to leverage the intrinsic information in unlabeled instances to improve classification models. However, these techniques assume that the labeled instances cover all the classes to learn which might not be the case. Moreover, when in the presence of an imbalanced class distribution, getting labeled instances from minority classes might be very costly, requiring extensive labeling, if queries are randomly selected. Active learning allows asking an oracle to label new instances, which are selected by criteria, aiming to reduce the labeling effort. D-Confidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we evaluate the performance of d-Confidence in comparison to its baseline criteria over tabular and text datasets. We provide empirical evidence that d-Confidence reduces label disclosure complexity-which we have defined as the number of queries required to identify instances from all classes to learn-when in the presence of imbalanced data. © 2012 The Brazilian Computer Society.

CloseRead Abstract

2010

Ensembles of jittered association rule classifiers

Authors
Azevedo, PJ; Jorge, AM;

Publication
DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
The ensembling of classifiers tends to improve predictive accuracy. To obtain an ensemble with N classifiers, one typically needs to run N learning processes. In this paper we introduce and explore Model Jittering Ensembling, where one single model is perturbed in order to obtain variants that can be used as an ensemble. We use as base classifiers sets of classification association rules. The two methods of jittering ensembling we propose are Iterative Reordering Ensembling (IRE) and Post Bagging (PB). Both methods start by learning one rule set over a single run, and then produce multiple rule sets without relearning. Empirical results on 36 data sets are positive and show that both strategies tend to reduce error with respect to the single model association rule classifier. A bias-variance analysis reveals that while both IRE and PB are able to reduce the variance component of the error, IRE is particularly effective in reducing the bias component. We show that Model Jittering Ensembling can represent a very good speed-up w.r.t. multiple model learning ensembling. We also compare Model Jittering with various state of the art classifiers in terms of predictive accuracy and computational efficiency.

CloseRead Abstract

2006

Design of an end-to-end method to extract information from tables

Authors
Costa e Silva, A; Jorge, AM; Torgo, L;

Publication
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION

Abstract
This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer fort-nat can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extracting information from tables and analyse table-related research to place the contribution of different authors, find the paths research is following, and identify issues that are still unsolved. We then analyse current approaches to evaluating table processing algorithms and propose two new metrics for the task of segmenting cells/columns/rows. We proceed to design our own end-to-end method, where there is a higher interaction between different steps; we indicate how back loops in the usual order of the steps can reduce the possibility of errors and contribute to solving previously unsolved problems. Finally, we explore how the actual interpretation of the table not only allows inferring the accuracy of the overall extraction process but also contributes to actually improving its quality. In order to do so, we believe interpretation has to consider context-specific knowledge; we explore how the addition of this knowledge can be made in a plug-in/out manner, such that the overall method will maintain its operability in different contexts.

CloseRead Abstract

2012

Finding interesting contexts for explaining deviations in bus trip duration using distribution rules

Authors
Jorge, AM; Mendes Moreira, J; De Sousa, JF; Soares, C; Azevedo, PJ;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
In this paper we study the deviation of bus trip duration and its causes. Deviations are obtained by comparing scheduled times against actual trip duration and are either delays or early arrivals. We use distribution rules, a kind of association rules that may have continuous distributions on the consequent. Distribution rules allow the systematic identification of particular conditions, which we call contexts, under which the distribution of trip time deviations differs significantly from the overall deviation distribution. After identifying specific causes of delay the bus company operational managers can make adjustments to the timetables increasing punctuality without disrupting the service. © Springer-Verlag Berlin Heidelberg 2012.

CloseRead Abstract

2012

HCAC: Semi-supervised hierarchical clustering using confidence-based active learning

Authors
Nogueira, BM; Jorge, AM; Rezende, SO;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
Despite their importance, hierarchical clustering has been little explored for semi-supervised algorithms. In this paper, we address the problem of semi-supervised hierarchical clustering by using an active learning solution with cluster-level constraints. This active learning approach is based on a new concept of merge confidence in agglomerative clustering. When there is low confidence in a cluster merge the user is queried and provides a cluster-level constraint. The proposed method is compared with an unsupervised algorithm (average-link) and two state-of-the-art semi-supervised algorithms (pairwise constraints and Constrained Complete-Link). Results show that our algorithm tends to be better than the two semi-supervised algorithms and can achieve a significant improvement when compared to the unsupervised algorithm. Our approach is particularly useful when the number of clusters is high which is the case in many real problems. © 2012 Springer-Verlag Berlin Heidelberg.

CloseRead Abstract