Publicacoes - INESC TEC

Publicações

Publicações por Leonardo Gomes Capozzi

2021

Optimizing Person Re-Identification Using Generated Attention Masks

Autores
Capozzi, L; Pinto, JR; Cardoso, JS; Rebelo, A;

Publicação
CIARP

Abstract
The task of person re-identification has important applications in security and surveillance systems. It is a challenging problem since there can be a lot of differences between pictures belonging to the same person, such as lighting, camera position, variation in poses and occlusions. The use of Deep Learning has contributed greatly towards more effective and accurate systems. Many works use attention mechanisms to force the models to focus on less distinctive areas, in order to improve performance in situations where important information may be missing. This paper proposes a new, more flexible method for calculating these masks, using a U-Net which receives a picture and outputs a mask representing the most distinctive areas of the picture. Results show that the method achieves an accuracy comparable or superior to those in state-of-the-art methods.

FecharLer Abstract

2021

End-to-End Deep Sketch-to-Photo Matching Enforcing Realistic Photo Generation

Autores
Capozzi, L; Pinto, JR; Cardoso, JS; Rebelo, A;

Publicação
CIARP

Abstract
The traditional task of locating suspects using forensic sketches posted on public spaces, news, and social media can be a difficult task. Recent methods that use computer vision to improve this process present limitations, as they either do not use end-to-end networks for sketch recognition in police databases (which generally improve performance) or/and do not offer a photo-realistic representation of the sketch that could be used as alternative if the automatic matching process fails. This paper proposes a method that combines these two properties, using a conditional generative adversarial network (cGAN) and a pre-trained face recognition network that are jointly optimised as an end-to-end model. While the model can identify a short list of potential suspects in a given database, the cGAN offers an intermediate realistic face representation to support an alternative manual matching process. Evaluation on sketch-photo pairs from the CUFS, CUFSF and CelebA databases reveal the proposed method outperforms the state-of-the-art in most tasks, and that forcing an intermediate photo-realistic representation only results in a small performance decrease.

FecharLer Abstract

2022

Streamlining Action Recognition in Autonomous Shared Vehicles with an Audiovisual Cascade Strategy

Autores
Pinto, JR; Carvalho, P; Pinto, C; Sousa, A; Capozzi, L; Cardoso, JS;

Publicação
PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5

Abstract
With the advent of self-driving cars, and big companies such as Waymo or Bosch pushing forward into fully driverless transportation services, the in-vehicle behaviour of passengers must be monitored to ensure safety and comfort. The use of audio-visual information is attractive by its spatio-temporal richness as well as non-invasive nature, but faces tile likely constraints posed by available hardware and energy consumption. Hence new strategies are required to improve the usage of these scarce resources. We propose the processing of audio and visual data in a cascade pipeline for in-vehicle action recognition. The data is processed by modality-specific sub-modules. with subsequent ones being used when a confident classification is not reached. Experiments show an interesting accuracy-acceleration trade-off when compared with a parallel pipeline with late fusion, presenting potential for industrial applications on embedded devices.

FecharLer Abstract

2022

Toward Vehicle Occupant-Invariant Models for Activity Characterization

Autores
Capozzi, L; Barbosa, V; Pinto, C; Pinto, JR; Pereira, A; Carvalho, PM; Cardoso, JS;

Publicação
IEEE ACCESS

Abstract
With the advent of self-driving cars and the push by large companies into fully driverless transportation services, monitoring passenger behaviour in vehicles is becoming increasingly important for several reasons, such as ensuring safety and comfort. Although several human action recognition (HAR) methods have been proposed, developing a true HAR system remains a very challenging task. If the dataset used to train a model contains a small number of actors, the model can become biased towards these actors and their unique characteristics. This can cause the model to generalise poorly when confronted with new actors performing the same actions. This limitation is particularly acute when developing models to characterise the activities of vehicle occupants, for which data sets are short and scarce. In this study, we describe and evaluate three different methods that aim to address this actor bias and assess their performance in detecting in-vehicle violence. These methods work by removing specific information about the actor from the model's features during training or by using data that is independent of the actor, such as information about body posture. The experimental results show improvements over the baseline model when evaluated with real data. On the Hanau03 Vito dataset, the accuracy improved from 65.33% to 69.41%. On the Sunnyvale dataset, the accuracy improved from 82.81% to 86.62%.

FecharLer Abstract

2025

End-to-End Occluded Person Re-Identification With Artificial Occlusion Generation

Autores
Capozzi, L; Cardoso, JS; Rebelo, A;

Publicação
IEEE ACCESS

Abstract
In recent years, the task of person re-identification (Re-ID) has improved considerably with the advances in deep learning methodologies. However, occluded person Re-ID remains a challenging task, as parts of the body of the individual are frequently hidden by various objects, obstacles, or other people, making the identification process more difficult. To address these issues, we introduce a novel data augmentation strategy using artificial occlusions, consisting of random shapes and objects from a small image dataset that was created. We also propose an end-to-end methodology for occluded person Re-ID, which consists of three branches: a global branch, a feature dropping branch, and an occlusion detection branch. Experimental results show that the use of random shape occlusions is superior to random erasing using our architecture. Results on six datasets consisting of three tasks (holistic, partial and occluded person Re-ID) demonstrate that our method performs favourably against state-of-the-art methodologies.

FecharLer Abstract

2026

Deciphering the Silent Signals: Unveiling Frequency Importance for Wi-Fi-Based Human Pose Estimation with Explainability

Autores
Capozzi, L; Ferreira, L; Gonçalves, T; Rebelo, A; Cardoso, JS; Sequeira, AF;

Publicação
PATTERN RECOGNITION AND IMAGE ANALYSIS, IBPRIA 2025, PT II

Abstract
The rapid advancement of wireless technologies, particularly Wi-Fi, has spurred significant research into indoor human activity detection across various domains (e.g., healthcare, security, and industry). This work explores the non-invasive and cost-effective Wi-Fi paradigm and the application of deep learning for human activity recognition using Wi-Fi signals. Focusing on the challenges in machine interpretability, motivated by the increase in data availability and computational power, this paper uses explainable artificial intelligence to understand the inner workings of transformer-based deep neural networks designed to estimate human pose (i.e., human skeleton key points) from Wi-Fi channel state information. Using different strategies to assess the most relevant sub-carriers (i.e., rollout attention and masking attention) for the model predictions, we evaluate the performance of the model when it uses a given number of sub-carriers as input, selected randomly or by ascending (high-attention) or descending (low-attention) order. We concluded that the models trained with fewer (but relevant) sub-carriers are competitive with the baseline (trained with all sub-carriers) but better in terms of computational efficiency (i.e., processing more data per second).

FecharLer Abstract