Details
Name
Gabriel JesusRole
Affiliated ResearcherSince
29th September 2021
Nationality
Timor lesteCentre
Human-Centered Computing and Information ScienceContacts
+351222094000
gabriel.jesus@inesctec.pt
2026
Authors
Araújo, A; de Jesus, G; Nunes, S;
Publication
Lecture Notes in Computer Science
Abstract
Developing information retrieval (IR) systems that enable access across multiple languages is crucial in multilingual contexts. In Timor-Leste, where Tetun, Portuguese, English, and Indonesian are official and working languages, no cross-lingual information retrieval (CLIR) solutions currently exist to support information access across these languages. This study addresses that gap by investigating CLIR approaches tailored to the linguistic landscape of Timor-Leste. Leveraging an existing monolingual Tetun document collection and ad-hoc text retrieval baselines, we explore the feasibility of CLIR for Tetun. Queries were manually translated into Portuguese, English, and Indonesian to create a multilingual query set. These were then automatically translated back into Tetun using Google Translate and several large language models, and used to retrieve documents in Tetun. Results show that Google Translate is the most reliable tool for Tetun CLIR overall, and the Hiemstra LM consistently outperforms BM25 and DFR BM25 in cross-lingual retrieval performance. However, overall effectiveness remains up to 26.95% points lower than that of the monolingual baseline, underscoring the limitations of current translation tools and the challenges of developing an effective CLIR for Tetun. Despite these challenges, this work establishes the first CLIR baseline for Tetun ad-hoc text retrieval, providing a foundation for future research in this under-resourced setting. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
2025
Authors
Jesus, G; Singh, SAK; Nunes, S; Yates, A;
Publication
PROCEEDINGS OF THE 2025 INTERNATIONAL ACM SIGIR CONFERENCE ON INNOVATIVE CONCEPTS AND THEORIES IN INFORMATION RETRIEVAL, ICTIR 2025
Abstract
Dense retrieval models are generally trained using supervised learning approaches for representation learning, which require a labeled dataset (i.e., query-document pairs). However, training such models from scratch is not feasible for most languages, particularly under-resourced ones, due to data scarcity and computational constraints. As an alternative, pretrained dense retrieval models can be fine-tuned for specific downstream tasks or applied directly in zero-shot settings. Given the lack of labeled data for Tetun and the fact that existing dense retrieval models do not currently support the language, this study investigates their application in zero-shot, out-of-distribution scenarios. We adapted these models to Tetun documents, producing zero-shot embeddings, to evaluate their performance across various document representations and retrieval strategies for the ad-hoc text retrieval task. The results show that most pretrained monolingual dense retrieval models outperformed their multilingual counterparts in various configurations. Given the lack of dense retrieval models specialized for Tetun, we combine Hiemstra LM with ColBERTv2 in a hybrid strategy, achieving a relative improvement of +2.01% in P@10, +4.24% in MAP@10, and +2.45% in NDCG@10 over the baseline, based on evaluations using 59 queries and up to 2,000 retrieved documents per query. We propose dual tuning parameters for the score fusion approach commonly used in hybrid retrieval and demonstrate that enriching document titles with summaries generated by a large language model (LLM) from the documents' content significantly enhances the performance of hybrid retrieval strategies in Tetun. To support reproducibility, we publicly release the LLM-generated document summaries and run files.
2025
Authors
de Jesus, G; Nunes, S;
Publication
PROCEEDINGS OF THE 2025 INTERNATIONAL ACM SIGIR CONFERENCE ON INNOVATIVE CONCEPTS AND THEORIES IN INFORMATION RETRIEVAL, ICTIR 2025
Abstract
Advancements in large language model (LLM)-based conversational assistants have transformed search experiences into more natural and context-aware dialogues that resemble human conversation. However, limited access to interaction log data hinders a deeper understanding of their real-world usage. To address this gap, we analyzed 16,952 prompt logs from 904 unique users of Labadain Chat, an LLM-based conversational assistant designed for Tetun speakers, to uncover patterns in user search behavior, engagement, and intent. Our findings show that most users (29.87%) spent between one and five minutes per session, with an average of 43 unique daily users. The majority (93.97%) submitted multiple prompts per session, with an average session duration of 16.9 minutes. Most users (95.22%) were based in Timor-Leste, with education and science (28.75%) and health (28.00%) being the most searched topics. We compared our findings with a study on Google Bard logs in English, revealing similar search characteristics-including engagement duration, command-based instructions, and requests for specific assistance. Furthermore, a comparison with two conventional search engines suggests that LLM-based conversational systems have influenced user search behavior on traditional platforms, reflecting a broader trend toward command-driven queries. These insights contribute to a deeper understanding of how user search behavior evolves, particularly within low-resource language communities. To support future research, we publicly release LabadainLog-17k+, a dataset of over 17,000 real-world user search logs in Tetun, offering a unique resource for investigating conversational search in this language.
2024
Authors
de Jesus, G; Nunes, S;
Publication
CEUR Workshop Proceedings
Abstract
The Cranfield paradigm has served as a foundational approach for developing test collections, with relevance judgments typically conducted by human assessors. However, the emergence of large language models (LLMs) has introduced new possibilities for automating these tasks. This paper explores the feasibility of using LLMs to automate relevance assessments, particularly within the context of low-resource languages. In our study, LLMs are employed to automate relevance judgment tasks, by providing a series of query-document pairs in Tetun as the input text. The models are tasked with assigning relevance scores to each pair, where these scores are then compared to those from human annotators to evaluate the inter-annotator agreement levels. Our investigation reveals results that align closely with those reported in studies of high-resource languages. © 2024 CEUR-WS. All rights reserved.
2024
Authors
Jesus, Gd; Nunes, S;
Publication
CoRR
Abstract
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.