2017
Authors
Saleiro, P; Sarmento, L; Rodrigues, EM; Soares, C; Oliveira, E;
Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2017)
Abstract
This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. We start by experimenting with a relatively small sample and focusing on three challenges: volume of training data, vocabulary size and intrinsic evaluation metrics. Using a single GPU, we were able to scale up vocabulary size from 2048 words embedded and 500K training examples to 32768 words over 10M training examples while keeping a stable validation loss and approximately linear trend on training time per epoch. We also observed that using less than 50% of the available training examples for each vocabulary size might result in overfitting. Results on intrinsic evaluation show promising performance for a vocabulary size of 32768 words. Nevertheless, intrinsic evaluation metrics suffer from over-sensitivity to their corresponding cosine similarity thresholds, indicating that a wider range of metrics need to be developed to track progress.
2017
Authors
das Dôres, SN; Soares, C; Ruiz, DDA;
Publication
Proceedings of the International Workshop on Automatic Selection, Configuration and Composition of Machine Learning Algorithms co-located with the European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases, AutoML@PKDD/ECML 2017, Skopje, Macedonia, September 22, 2017.
Abstract
Feature Selection is important to improve learning performance, reduce computational complexity and decrease required storage. There are multiple methods for feature selection, with varying impact and computational cost. Therefore, choosing the right method for a given data set is important. In this paper, we analyze the advantages of metalearning for feature selection employment. This issue is relevant because a wrong decision may imply additional processing, when FS is unnecessarily applied, or in a loss of performance, when not used in a problem for which it is appropriate. Our results showed that, although there is an advantage in using metalearning, these gains are not yet sufficiently relevant, which opens the way for new research to be carried out in the area.
2017
Authors
Brazdil, P; Vilalta, R; Giraud Carrier, CG; Soares, C;
Publication
Encyclopedia of Machine Learning and Data Mining
Abstract
In the area machine learning / data mining many diverse algorithms are available nowadays and hence the selection of the most suitable algorithm may be a challenge. Tbhis is aggravated by the fact that many algorithms require that certain parameters be set. If a wrong algorithm and/or parameter configuration is selected, substandard results may be obtained. The topic of metalearning aims to facilitate this task. Metalearning typically proceeds in two phases. First, a given set of algorithms A (e.g. classification algorithms) and datasets D is identified and different pairs < ai,dj > from these two sets are chosen for testing. The dataset di is described by certain meta-features which together with the performance result of algorithm ai constitute a part of the metadata. In the second phase the metadata is used to construct a model, usually again with recourse to machine learning methods. The model represents a generalization of various base-level experiments. The model can then be applied to the new dataset to recommend the most suitable algorithm or a ranking ordered by relative performance. This article provides more details about this area. Besides, it discusses also how the method can be combined with hyperparameter optimization and extended to sequences of operations (workflows). © Springer Science+Business Media New York 2011, 2017
2017
Authors
Saleiro, P; Frayling, NM; Rodrigues, EM; Soares, C;
Publication
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017
Abstract
Improvements of entity-relationship (E-R) search techniques have been hampered by a lack of test collections, particularly for complex queries involving multiple entities and relationships. In this paper we describe a method for generating E-R test queries to support comprehensive E-R search experiments. Queries and relevance judgments are created from content that exists in a tabular form where columns represent entity types and the table structure implies one or more relationships among the entities. Editorial work involves creating natural language queries based on relationships represented by the entries in the table. We have publicly released the RELink test collection comprising 600 queries and relevance judgments obtained from a sample of Wikipedia List-of-lists-oflists tables. The latter comprise tuples of entities that are extracted from columns and labelled by corresponding entity types and relationships they represent. In order to facilitate research in complex E-R retrieval, we have created and released as open source the RELink Framework that includes Apache Lucene indexing and search specifically tailored to E-R retrieval. RELink includes entity and relationship indexing based on the ClueWeb-09-BWeb collection with FACC1 text span annotations linked to Wikipedia entities. With ready to use search resources and a comprehensive test collection, we support community in pursuing E-R research at scale. © 2017 ACM.
2017
Authors
Cunha, T; Soares, C; de Carvalho, ACPLF;
Publication
DISCOVERY SCIENCE, DS 2017
Abstract
Recommender Systems have become increasingly popular, propelling the emergence of several algorithms. As the number of algorithms grows, the selection of the most suitable algorithm for a new task becomes more complex. The development of new Recommender Systems would benefit from tools to support the selection of the most suitable algorithm. Metalearning has been used for similar purposes in other tasks, such as classification and regression. It learns predictive models to map characteristics of a dataset with the predictive performance obtained by a set of algorithms. For such, different types of characteristics have been proposed: statistical and/or information-theoretical, model-based and landmarkers. Recent studies argue that landmarkers are successful in selecting algorithms for different tasks. We propose a set of landmarkers for a Metalearning approach to the selection of Collaborative Filtering algorithms. The performance is compared with a state of the art systematic metafeatures approach using statistical and/or information-theoretical metafeatures. The results show that the metalevel accuracy performance using landmarkers is not statistically significantly better than the metafeatures obtained with a more traditional approach. Furthermore, the baselevel results obtained with the algorithms recommended using landmarkers are worse than the ones obtained with the other metafeatures. In summary, our results show that, contrary to the results obtained in other tasks, these landmarkers are not necessarily the best metafeatures for algorithm selection in Collaborative Filtering.
2017
Authors
Saleiro, P; Rodrigues, EM; Soares, C; Oliveira, EC;
Publication
Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017
Abstract
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.