Publicacoes - INESC TEC

Publicações

Publicações por CTM

2025

A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning

Autores
Vilaça, L; Yu, Y; Viana, P;

Publicação
ACM COMPUTING SURVEYS

Abstract
Audio-visual correlation learning aims at capturing and understanding natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we also discuss some tasks of definition and paradigm applied in AI multimedia. In addition, we investigate objective functions frequently used and discuss how audio-visual data is exploited in the optimization process, i.e., the different methodologies for representing knowledge in the audio-visual domain. In fact, we focus on how human-understandable mechanisms, i.e., structured knowledge that reflects comprehensible knowledge, can guide the learning process. Most importantly, we provide a summarization of the recent progress of Audio-Visual Correlation Learning (AVCL) and discuss the future research directions.

FecharLer Abstract

2025

Correction to: A Review of Recent Advances and Challenges in Grocery Label Detection and Recognition (Applied Sciences, (2023), 13, 5, (2871), 10.3390/app13052871)

Autores
Guimarães, V; Nascimento, J; Viana, P; Carvalho, P;

Publicação
Applied Sciences (Switzerland)

Abstract
There was an error in the original publication [1]. The statement in the Acknowledgments section is incorrect and should be removed because the official start of the project WATSON was after the paper’s publication date. The authors state that the scientific conclusions are unaffected. This correction was approved by the Academic Editor. The original publication has also been updated. © 2025 by the authors.

FecharLer Abstract

2025

Dialogue-AV: A Dialogue-Attended Audiovisual Dataset

Autores
Vilaça, L; Viana, P; Yu, Y;

Publicação
CBMI

Abstract
This work introduces Dialogue-AV, a benchmarking dataset for Audio-Video-Language (AVL). We propose using dialogue to describe video content instead of single captions, capturing nuances and shared meanings between audio and visual elements. This approach contributes significantly to improving the diversity of video descriptions and enables comprehensive evaluation of AVL learning across different downstream tasks, such as Cross-Modal Retrieval, Visual Question-Answering, and Video Captioning. Our dataset comprises approximately 258k audiovisual samples accompanied by dialogue-based descriptions for benchmarking. Dialogue-AV builds upon existing State-of-the-Art (SOTA) datasets that feature human-generated descriptions, enhancing them with model-generated ones that describe all modalities. We also present zero-shot baseline results utilising SOTA Visual-Language Models (VLMs), demonstrating that Dialogue-AV is capable of benchmarking a variety of downstream tasks with diverse inputs. Our key contributions include: 1) Dialogue-AV, a benchmark dataset for dialogue-based AVL models; and 2) benchmarks that expose the limitations of current SOTA VLMs. The code and dataset are accessible at: github.com/lvilaca16/dialogue-av. © 2025 IEEE.

FecharLer Abstract

2025

Is it Enough to Ask Questions? Dialogue Evaluation through Question Answering and Generation

Autores
Luís Vilaça; Paula Viana;

Publicação
Proceedings of the 2nd ACM Workshop in AI-powered Question & Answering Systems

Abstract

2025

An Assessment of the Sensory Function in the Maxillofacial Region: A Dual-Case Pilot Study

Autores
Aguiar, JM; da Silva, JM; Fonseca, C; Marinho, J;

Publicação
SENSORS

Abstract
Trigeminal somatosensory-evoked potentials (TSEPs) provide valuable insight into neural responses to oral stimuli. This study investigates TSEP recording methods and their impact on interpreting results in clinical settings to improve the development process of neurostimulation-based therapies. The experiments and results presented here aim at identifying appropriate stimulation characteristics to design an active dental prosthesis capable of contributing to restoring the lost neurosensitive connection between the teeth and the brain. Two methods of TSEP acquisition, traditional and occluded, were used, each conducted by a different volunteer. Traditional TSEP acquisition involves stimulation at different sites with varying parameters to achieve a control base. In contrast, occluded TSEPs examine responses acquired under low- and high-force bite conditions to assess the influence of periodontal mechanoreceptors and muscle activation on measurements. Traditional TSEPs demonstrated methodological feasibility with satisfactory results despite a limited subject pool. However, occluded TSEPs presented challenges in interpreting results, with responses deviating from expected norms, particularly under high force conditions, due to the simultaneous occurrence of stimulation and dental occlusion. While traditional TSEPs highlight methodological feasibility, the occluded approach highlights complexities in outcome interpretation and urges caution in clinical application. Previously unreported results were achieved, which underscores the importance of conducting further research with larger sample sizes and refined protocols in order to strengthen the reliability and validity of TSEP assessments.

FecharLer Abstract

2025

A Reinforcement Learning Based Recommender System Framework for Web Apps: Radio and Game Aggregators Scenarios

Autores
Batista, A; Torres, JM; Sobral, P; Moreira, RS; Soares, C; Pereira, I;

Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2024, PT I

Abstract
Recommendation systems can play an important role in today's digital content platforms by supporting the suggestion of relevant content in a personalised manner for each customer. Such content customisation has not been consistent across most media domains, and particularly on radio streaming and gaming aggregators, which are the two real-world application domains focused in this work. The challenges faced in these application areas are the dynamic nature of user preferences and the difficulty of generating recommendations for less popular content, due to the overwhelming choice and polarisation of available top content. We present the design and implementation of a Reinforcement Learning-based Recommendation System (RLRS) for web applications, using a Deep Deterministic Policy Gradient (DDPG) agent and, as a reward function, a weighted sum of the user Click Distribution (CD) across the recommended items and the Dwell Time (DT), a measure of the time users spend interacting with those items. Our system has been deployed in real production scenarios with preliminary but promising results. Several metrics are used to track the effectiveness of our approach, such as content coverage, category diversity, and intra-list similarity. In both scenarios tested, the system shows consistent improvement and adaptability over time, reinforcing its applicability.

FecharLer Abstract