Publications

Publications by Mário João Antunes

2015

Active Manifold Learning with Twitter Big Data

Authors
Silva, C; Antunes, M; Costa, J; Ribeiro, B;

Publication
INNS CONFERENCE ON BIG DATA 2015 PROGRAM

Abstract
The data produced by Internet applications have increased substantially. Big data is a flaring field that deals with this deluge of data by using storage techniques, dedicated infrastructures and development frameworks for the parallelization of defined tasks and its consequent reduction. These solutions however fall short in online and highly data demanding scenarios, since users expect swift feedback. Reduction techniques are efficiently used in big data online applications to improve classification problems. Reduction in big data usually falls in one of two main methods: (i) reduce the dimensionality by pruning or reformulating the feature set; (ii) reduce the sample size by choosing the most relevant examples. Both approaches have benefits, not only of time consumed to build a model, but eventually also performance-wise, usually by reducing overfitting and improving generalization capabilities. In this paper we investigate reduction techniques that tackle both dimensionality and size of big data. We propose a framework that combines a manifold learning approach to reduce dimensionality and an active learning SVM-based strategy to reduce the size of labeled sample. Results on Twitter data show the potential of the proposed active manifold learning approach.

CloseRead Abstract

2015

DOTS: Drift Oriented Tool System

Authors
Costa, J; Silva, C; Antunes, M; Ribeiro, B;

Publication
NEURAL INFORMATION PROCESSING, ICONIP 2015, PT IV

Abstract
Drift is a given in most machine learning applications. The idea that models must accommodate for changes, and thus be dynamic, is ubiquitous. Current challenges include temporal data streams, drift and non-stationary scenarios, often with text data, whether in social networks or in business systems. There are multiple drift patterns types: concepts that appear and disappear suddenly, recurrently, or even gradually or incrementally. Researchers strive to propose and test algorithms and techniques to deal with drift in text classification, but it is difficult to find adequate benchmarks in such dynamic environments. In this paper we present DOTS, Drift Oriented Tool System, a framework that allows for the definition and generation of text-based datasets where drift characteristics can be thoroughly defined, implemented and tested. The usefulness of DOTS is presented using a Twitter stream case study. DOTS is used to define datasets and test the effectiveness of using different document representation in a Twitter scenario. Results show the potential of DOTS in machine learning research.

CloseRead Abstract

2017

Automatic Documents Counterfeit Classification Using Image Processing and Analysis

Authors
Vieira, R; Antunes, M; Silva, C; Assis, A;

Publication
PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2017)

Abstract
Counterfeit detection in official documents has challenged forensic experts on trying to correlate them to improve the identification of forgery authors by criminal investigators. Past counterfeit investigation on the Portuguese Police Forensic Laboratory allowed the construction of an organized set of digital images related to counterfeited documents, helping manual identification of new counterfeiters modus operandi. However, these images are usually stored in distinct resolutions, may have different sizes and could have been captured under different types of illumination. In this paper we present a methodology to automate a counterfeit identification modus operandi, by comparing a given document image with a database of previously catalogued counterfeited documents images. The proposed method ranks the identified counterfeited documents and allows the forensic experts to drive their attention to the most similar documents. It takes advantage of scalable algorithms under the OpenCV framework that compare images, match patterns and analyse textures and colours. We present a set of tests with distinct datasets with promising results.

CloseRead Abstract

2017

Performance Metrics for Model Fusion in Twitter Data Drifts

Authors
Costa, J; Silva, C; Antunes, M; Ribeiro, B;

Publication
PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2017)

Abstract
Ensemble approaches have revealed remarkable abilities to tackle different learning challenges, namely in dynamic scenarios with concept drift, e.g. in social networks, as Twitter. Several efforts have been engaged in defining strategies to combine the models that constitute an ensemble. In this work, we investigate the effect of using different metrics for combining ensembles' models, specifically performance-based metrics. We propose five performance combining metrics, having in mind that we may take advantage of diversity in classifiers, as their individual performance takes a leading role in defining their contribution to the ensemble. Experimental results on a Twitter dataset, artificially timestamped, suggest that using performance metrics to combine the models that constitute an ensemble can introduce relevant improvements in the overall ensemble performance.

CloseRead Abstract

2014

Concept Drift Awareness in Twitter Streams

Authors
Costa, J; Silva, C; Antunes, M; Ribeiro, B;

Publication
2014 13TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA)

Abstract
Learning in non-stationary environments is not an easy task and requires a distinctive approach. The learning model must not only have the ability to continuously learn, but also the ability to acquired new concepts and forget the old ones. Additionally, given the significant importance that social networks gained as information networks, there is an ever-growing interest in the extraction of complex information used for trend detection, promoting services or market sensing. This dynamic nature tends to limit the performance of traditional static learning models and dynamic learning strategies must be put forward. In this paper we present a learning strategy to learn with drift in the occurrence of concepts in Twitter. We propose three different models: a time-window model, an ensemble-based model and an incremental model. Since little is known about the types of drift that can occur in Twitter, we simulate different types of drift by artificially timestamping real Twitter messages in order to evaluate and validate our strategy. Results are so far encouraging regarding learning in the presence of drift, along with classifying messages in Twitter streams.

CloseRead Abstract

2013

CrowdTargeting: Making Crowds More Personal

Authors
Costa, J; Silva, C; Ribeiro, B; Antunes, M;

Publication
2013 8TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP 2013)

Abstract
Crowdsourcing is a bubbling research topic that has the potential to be applied in numerous online and social scenarios. It consists on obtaining services or information by soliciting contributions from a large group of people. However, the question of defining the appropriate scope of a crowd to tackle each scenario is still open. In this work we compare two approaches to define the scope of a crowd in a classification problem, casted as a recommendation system. We propose a similarity measure to determine the closeness of a specific user to each crowd contributor and hence to define the appropriate crowd scope. We compare different levels of customization using crowd-based information, allowing non-experts classification by crowds to be tuned to substitute the user profile definition. Results on a real recommendation data set show the potential of making crowds more personal, i.e. of tuning the crowd to the crowdtarget.

CloseRead Abstract