Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by Paula Viana

2024

VEMOCLAP: A video emotion classification web application

Authors
Sulun, S; Viana, P; Davies, MEP;

Publication
IEEE International Symposium on Multimedia, ISM 2024, Tokyo, Japan, December 11-13, 2024

Abstract
We introduce VEMOCLAP: Video EMOtion Classifier using Pretrained features, the first readily available and open-source web application that analyzes the emotional content of any user-provided video. We improve our previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention. Our approach increases the state-of-the-art classification accuracy on the Ekman-6 video emotion dataset by 4.3% and offers an online application for users to run our model on their own videos or YouTube videos. We invite the readers to try our application at serkansulun.com/app.

2025

A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning

Authors
Vilaça, L; Yu, Y; Viana, P;

Publication
ACM COMPUTING SURVEYS

Abstract
Audio-visual correlation learning aims at capturing and understanding natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we also discuss some tasks of definition and paradigm applied in AI multimedia. In addition, we investigate objective functions frequently used and discuss how audio-visual data is exploited in the optimization process, i.e., the different methodologies for representing knowledge in the audio-visual domain. In fact, we focus on how human-understandable mechanisms, i.e., structured knowledge that reflects comprehensible knowledge, can guide the learning process. Most importantly, we provide a summarization of the recent progress of Audio-Visual Correlation Learning (AVCL) and discuss the future research directions.

2025

Correction to: A Review of Recent Advances and Challenges in Grocery Label Detection and Recognition (Applied Sciences, (2023), 13, 5, (2871), 10.3390/app13052871)

Authors
Guimarães, V; Nascimento, J; Viana, P; Carvalho, P;

Publication
Applied Sciences (Switzerland)

Abstract
There was an error in the original publication [1]. The statement in the Acknowledgments section is incorrect and should be removed because the official start of the project WATSON was after the paper’s publication date. The authors state that the scientific conclusions are unaffected. This correction was approved by the Academic Editor. The original publication has also been updated. © 2025 by the authors.

2024

Enhancing Indoor Localisation: a Bluetooth Low Energy (BLE) Beacon Placement approach

Authors
Dias, J; Oliper, D; Soares, MR; Viana, P;

Publication
2024 IEEE 22ND MEDITERRANEAN ELECTROTECHNICAL CONFERENCE, MELECON 2024

Abstract
This paper addresses the critical challenge of optimising beacon placement to support indoor location services and proposes a methodology to optimise the Base Station (BS) coverage keeping or even improving the system precision. The algorithm builds on top of the building schematics and takes into account several aspects that affect the radio link range (materials attenuation, Line of Sight (LOS) conditions, transmitted power and radio sensibility). The outcome is available as a coverage heat map. It is then compared with a standard layout following existing expert guidelines to evaluate the efficacy of the proposed layout.

2024

CONVERGE: A Vision-Radio Research Infrastructure Towards 6G and Beyond

Authors
Teixeira, FB; Ricardo, M; Coelho, A; Oliveira, HP; Viana, P; Paulino, N; Fontes, H; Marques, P; Campos, R; Pessoa, LM;

Publication
2024 JOINT EUROPEAN CONFERENCE ON NETWORKS AND COMMUNICATIONS & 6G SUMMIT, EUCNC/6G SUMMIT 2024

Abstract
Telecommunications and computer vision have evolved separately so far. Yet, with the shift to sub-terahertz (sub-THz) and terahertz (THz) radio communications, there is an opportunity to explore computer vision technologies together with radio communications, considering the dependency of both technologies on Line of Sight. The combination of radio sensing and computer vision can address challenges such as obstructions and poor lighting. Also, machine learning algorithms, capable of processing multimodal data, play a crucial role in deriving insights from raw and low-level sensing data, offering a new level of abstraction that can enhance various applications and use cases such as beamforming and terminal handovers. This paper introduces CONVERGE, a pioneering vision-radio paradigm that bridges this gap by leveraging Integrated Sensing and Communication (ISAC) to facilitate a dual View-to-Communicate, Communicate-to-View approach. CONVERGE offers tools that merge wireless communications and computer vision, establishing a novel Research Infrastructure (RI) that will be open to the scientific community and capable of providing open datasets. This new infrastructure will support future research in 6G and beyond concerning multiple verticals, such as telecommunications, automotive, manufacturing, media, and health.

2024

Movie trailer genre classification using multimodal pretrained features

Authors
Sulun, S; Viana, P; Davies, MEP;

Publication
EXPERT SYSTEMS WITH APPLICATIONS

Abstract
We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. These models extract high-level features related to visual scenery, objects, characters, text, speech, music, and audio effects. To intelligently fuse these pretrained features, we train small classifier models with low time and memory requirements. Employing the transformer model, our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling, efficiently exploiting the correspondence between all elements, as opposed to the fixed and low number of frames typically used by traditional methods. Our approach fuses features originating from different tasks and modalities, with different dimensionalities, different temporal lengths, and complex dependencies as opposed to current approaches. Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP). To foster future research, we make the pretrained features for the entire MovieNet dataset, along with our genre classification code and the trained models, publicly available.

  • 8
  • 11