Cookies
O website necessita de alguns cookies e outros recursos semelhantes para funcionar. Caso o permita, o INESC TEC irá utilizar cookies para recolher dados sobre as suas visitas, contribuindo, assim, para estatísticas agregadas que permitem melhorar o nosso serviço. Ver mais
Aceitar Rejeitar
  • Menu
Sobre

Sobre

Sérgio Nunes é Professor Associado do Departamento de Engenharia Informática da FEUP, Universidade do Porto e Investigador Sénior do INESC TEC. É Doutorado em Engenharia Informática (2010), na área da Recuperação de Informação, com trabalho focado no uso de caraterísticas temporais para estimar a relevância de informação. É Mestre em Gestão da Informação (2004) com trabalho desenvolvido na área da interoperabilidade entre sistemas de informação académicos.


Tem como principais interesses de investigação a área da recuperação de informação, a interação e visualização de informação, e os sistemas de informação em contexto web. No ensino, o foco são as áreas das bases de dados, das tecnologias da web, e da recuperação de informação, com a coordenação de diversas unidades curriculares em diferentes programas, nomeadamente o Programa Doutoral em Engenharia Informática, a Licenciatura e o Mestrado em Engenharia Informática, e o Mestrado em Multimédia.


Foi Diretor do U.Porto Media Innovation Labs (MIL), o Centro de Competências da Universidade do Porto com o objetivo de desenvolver a capacidade da universidade na área dos Media nas vertentes do ensino, investigação e inovação, promovendo colaborações entre as estruturas existentes e a articulação com parceiros externos.

Tópicos
de interesse
Detalhes

Detalhes

  • Nome

    Sérgio Nunes
  • Cargo

    Responsável de Área
  • Desde

    20 dezembro 2010
007
Publicações

2025

Zero-Shot and Hybrid Strategies for Tetun Ad-Hoc Text Retrieval

Autores
de Jesus, G; Singh, AK; Nunes, S; Yates, A;

Publicação
Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)

Abstract
Dense retrieval models are generally trained using supervised learning approaches for representation learning, which require a labeled dataset (i.e., query-document pairs). However, training such models from scratch is not feasible for most languages, particularly under-resourced ones, due to data scarcity and computational constraints. As an alternative, pretrained dense retrieval models can be fine-tuned for specific downstream tasks or applied directly in zero-shot settings. Given the lack of labeled data for Tetun and the fact that existing dense retrieval models do not currently support the language, this study investigates their application in zero-shot, out-of-distribution scenarios. We adapted these models to Tetun documents, producing zero-shot embeddings, to evaluate their performance across various document representations and retrieval strategies for the ad-hoc text retrieval task. The results show that most pretrained monolingual dense retrieval models outperformed their multilingual counterparts in various configurations. Given the lack of dense retrieval models specialized for Tetun, we combine Hiemstra LM with ColBERTv2 in a hybrid strategy, achieving a relative improvement of +2.01% in P@10, +4.24% in MAP@10, and +2.45% in NDCG@10 over the baseline, based on evaluations using 59 queries and up to 2,000 retrieved documents per query. We propose dual tuning parameters for the score fusion approach commonly used in hybrid retrieval and demonstrate that enriching document titles with summaries generated by a large language model (LLM) from the documents' content significantly enhances the performance of hybrid retrieval strategies in Tetun. To support reproducibility, we publicly release the LLM-generated document summaries and run files. © 2025 Elsevier B.V., All rights reserved.

2025

Insights into LLM-Based Conversational Search: A Study of Tetun-Speaking Users' Search Behavior

Autores
Jesus, GD; Nunes, S;

Publicação
Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)

Abstract
Advancements in large language model (LLM)-based conversational assistants have transformed search experiences into more natural and context-aware dialogues that resemble human conversation. However, limited access to interaction log data hinders a deeper understanding of their real-world usage. To address this gap, we analyzed 16,952 prompt logs from 904 unique users of Labadain Chat, an LLM-based conversational assistant designed for Tetun speakers, to uncover patterns in user search behavior, engagement, and intent. Our findings show that most users (29.87%) spent between one and five minutes per session, with an average of 43 unique daily users. The majority (93.97%) submitted multiple prompts per session, with an average session duration of 16.9 minutes. Most users (95.22%) were based in Timor-Leste, with education and science (28.75%) and health (28.00%) being the most searched topics. We compared our findings with a study on Google Bard logs in English, revealing similar search characteristics - including engagement duration, command-based instructions, and requests for specific assistance. Furthermore, a comparison with two conventional search engines suggests that LLM-based conversational systems have influenced user search behavior on traditional platforms, reflecting a broader trend toward command-driven queries. These insights contribute to a deeper understanding of how user search behavior evolves, particularly within low-resource language communities. To support future research, we publicly release LabadainLog-17k+, a dataset of over 17,000 real-world user search logs in Tetun, offering a unique resource for investigating conversational search in this language. © 2025 Elsevier B.V., All rights reserved.

2024

Indexing Portuguese NLP Resources with PT-Pump-Up

Autores
Almeida, R; Campos, R; Jorge, A; Nunes, S;

Publicação
Proceedings of the 16th International Conference on Computational Processing of Portuguese, PROPOR 2024, Santiago de Compostela, Galicia/Spain, March 12-15, 2024, Volume 2

Abstract

2024

A Community-Driven Data-to-Text Platform for Football Match Summaries

Autores
Fernandes, P; Nunes, S; Santos, L;

Publicação
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy.

Abstract
Data-to-text systems offer a transformative approach to generating textual content in data-rich environments. This paper describes the architecture and deployment of Prosebot, a community-driven data-to-text platform tailored for generating textual summaries of football matches derived from match statistics. The system enhances the visibility of lower-tier matches, traditionally accessible only through data tables. Prosebot uses a template-based Natural Language Generation (NLG) module to generate initial drafts, which are subsequently refined by the reading community. Comprehensive evaluations, encompassing both human-mediated and automated assessments, were conducted to assess the system's efficacy. Analysis of the community-edited texts reveals that significant segments of the initial automated drafts are retained, suggesting their high quality and acceptance by the collaborators. Preliminary surveys conducted among platform users highlight a predominantly positive reception within the community.

2024

Text2Story Lusa: A Dataset for Narrative Analysis in European Portuguese News Articles

Autores
Nunes, S; Jorge, AM; Amorim, E; Sousa, HO; Leal, A; Silvano, PM; Cantante, I; Campos, R;

Publicação
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy.

Abstract
Narratives have been the subject of extensive research across various scientific fields such as linguistics and computer science. However, the scarcity of freely available datasets, essential for studying this genre, remains a significant obstacle. Furthermore, datasets annotated with narratives components and their morphosyntactic and semantic information are even scarcer. To address this gap, we developed the Text2Story Lusa datasets, which consist of a collection of news articles in European Portuguese. The first datasets consists of 357 news articles and the second dataset comprises a subset of 117 manually densely annotated articles, totaling over 50 thousand individual annotations. By focusing on texts with substantial narrative elements, we aim to provide a valuable resource for studying narrative structures in European Portuguese news articles. On the one hand, the first dataset provides researchers with data to study narratives from various perspectives. On the other hand, the annotated dataset facilitates research in information extraction and related tasks, particularly in the context of narrative extraction pipelines. Both datasets are made available adhering to FAIR principles, thereby enhancing their utility within the research community.

Teses
supervisionadas

2023

Text Information Retrieval in Tetun

Autor
Gabriel de Jesus

Instituição
UP-FEUP

2023

Visual narratives supported by dynamic infographics: a case study in the sports domain

Autor
Pedro Manuel Santos Queirós

Instituição
UP-FEUP

2023

Guidelines to introduce Internet voting in Portuguese elections based on the Estonian case stuty

Autor
Marlon Vinícius Andrade de Luna Freire

Instituição
UP-FEUP

2023

Building a search engine on a sports-related platform

Autor
Ricardo Filipe da Silva Néri Marques Carvalho

Instituição
UP-FEUP

2023

Building Portuguese Language Resources for Natural Language Processing Tasks

Autor
Rúben Filipe Seabra de Almeida

Instituição
UP-FEUP