Publications

2025

KEIGO: Co-designing Log-Structured Merge Key-Value Stores with a Non-Volatile, Concurrency-aware Storage Hierarchy

Authors
Adao, R; Wu, ZJ; Zhou, CJ; Balmau, O; Paulo, J; Macedo, R;

Publication
PROCEEDINGS OF THE VLDB ENDOWMENT

Abstract
We present Keigo, a concurrency-and workload-aware storage middleware that enhances the performance of log-structured merge key-value stores (LSM KVS) when they are deployed on a hierarchy of storage devices. The key observation behind Keigo is that there is no one-size-fits-all placement of data across the storage hierarchy that optimizes for all workloads. Hence, to leverage the benefits of combining different storage devices, Keigo places files across different devices based on their parallelism, I/O bandwidth, and capacity. We introduce three techniques-concurrency-aware data placement, persistent read-only caching, and context-based I/O differentiation. Keigo is portable across different LSMs, is adaptable to dynamic workloads, and does not require extensive profiling. Our system enables established production KVS such as RocksDB, LevelDB, and Speedb to benefit from heterogeneous storage setups. We evaluate Keigo using synthetic and realistic workloads, showing that it improves the throughput of production-grade LSMs up to 4x for write-and 18x for read-heavy workloads when compared to general-purpose storage systems and specialized LSM KVS.

CloseRead Abstract

2025

Histogram approaches for imbalanced data streams regression

Authors
Aminian, E; Ribeiro, RP; Gama, J;

Publication
MACHINE LEARNING

Abstract
Imbalanced domains pose a significant challenge in real-world predictive analytics, particularly in the context of regression. While existing research has primarily focused on batch learning from static datasets, limited attention has been given to imbalanced regression in online learning scenarios. Intending to address this gap, in prior work, we proposed sampling strategies based on Chebyshev's inequality as the first methodologies designed explicitly for data streams. However, these approaches operated under the restrictive assumption that rare instances exclusively reside at distribution extremes. This study introduces histogram-based sampling strategies to overcome this constraint, proposing flexible solutions for imbalanced regression in evolving data streams. The proposed techniques - Histogram-based Undersampling (HistUS) and Histogram-based Oversampling (HistOS) - employ incremental online histograms to dynamically detect and prioritize rare instances across arbitrary regions of the target distribution to improve predictions in the rare cases. Comprehensive experiments on synthetic and real-world benchmarks demonstrate that HistUS and HistOS substantially improve rare-case prediction accuracy, outperforming baseline models while maintaining competitiveness with Chebyshev-based approaches.

CloseRead Abstract

2025

Large Language Models and Intelligent Agents in Education

Authors
Brito, WAT; Paulino, A; Mendes, M; Reis, A;

Publication
TECHNOLOGY AND INNOVATION IN LEARNING, TEACHING AND EDUCATION, TECH-EDU 2024, PT I

Abstract
This study examines the potential applications of large language models (LLMs) and intelligent agents in educational environments, with a particular focus on their role in enhancing the quality of teaching and learning processes. It provides a comprehensive overview of LLMs, emphasizing their capabilities in natural language analysis and generation. Furthermore, the study examines the potential for collaboration between LLMs and intelligent agents. While LLMs offer a foundation for AI capabilities, intelligent agents utilize these technologies to perform autonomous and context-aware actions within educational systems. A comparative analysis of various intelligent agent platforms, including Autogen, Langra, Crew AI, LM Studio, and Olama, constitutes a central component of this research. This study addresses the criteria that informed the selection of Crew AI for a case study, with a particular focus on its adaptability, ease of integration, and task execution capabilities in comparison to the other platforms. The research includes an analysis of the platform's performance in a controlled educational environment, highlighting the advantages of Crew AI in system functionality. These results demonstrate the necessity for a strategic and well-structured approach to integrating LLMs and intelligent agents, as their successful implementation can foster new competencies, enhance stakeholder engagement, and offer innovative teaching and learning experiences.

CloseRead Abstract

2025

Specifying Distributed Hash Tables with Allen Temporal Logic

Authors
Policarpo, N; Santos, JF; Cunha, A; Leitao, J; Costa, PA;

Publication
2025 IEEE/ACM 13TH INTERNATIONAL CONFERENCE ON FORMAL METHODS IN SOFTWARE ENGINEERING, FORMALISE

Abstract
Distributed Hash Tables (DHTs) remain to this day a central component of modern peer-to-peer (P2P) systems, which rely on complex DHT protocols to scale to millions of nodes. The correct operation of DHTs is therefore essential for the proper functioning of these systems. For this reason, formal methods have been employed to model and verify a range of correctness properties of various DHT protocols. However, these verification efforts have focused on protocol-specific properties, such as topological invariants, instead of functional properties. This focus limits our understanding of the precise guarantees offered by each protocol. We propose a protocol-independent axiomatization of DHT properties using Allen Temporal Logic (ATL). To validate our axiomatization, we have implemented it in the Alloy analyser and used our implementation both to establish a number of DHT-derived properties and to generate a set of DHT execution traces that cover an exhaustive list of DHT corner case behaviours.

CloseRead Abstract

2025

PolyNarrative: A Multilingual, Multilabel, Multi-domain Dataset for Narrative Extraction from News Articles

Authors
Nikolaidis, N; Stefanovitch, N; Silvano, P; Dimitrov, D; Yangarber, R; Guimaraes, N; Sartori, E; Androutsopoulos, I; Nakov, P; Da San Martino, G; Piskorski, J;

Publication
PROCEEDINGS OF THE 63RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS

Abstract
We present PolyNarrative, a new multilingual dataset of news articles, annotated for narratives. Narratives are overt or implicit claims, recurring across articles and languages, promoting a specific interpretation or viewpoint on an ongoing topic, often propagating mis/disinformation. We developed two-level taxonomies with coarse- and fine-grained narrative labels for two domains: (i) climate change and (ii) the military conflict between Ukraine and Russia. We collected news articles in four languages (Bulgarian, English, Portuguese, and Russian) related to the two domains and manually annotated them at the paragraph level. We make the dataset publicly available, along with experimental results of several strong baselines that assign narrative labels to news articles at the paragraph or the document level. We believe that this dataset will foster research in narrative detection and enable new research directions towards more multi-domain and highly granular narrative related tasks.

CloseRead Abstract

2025

A Reinforcement Learning Based Recommender System Framework for Web Apps: Radio and Game Aggregators Scenarios

Authors
Batista, A; Torres, JM; Sobral, P; Moreira, RS; Soares, C; Pereira, I;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2024, PT I

Abstract
Recommendation systems can play an important role in today's digital content platforms by supporting the suggestion of relevant content in a personalised manner for each customer. Such content customisation has not been consistent across most media domains, and particularly on radio streaming and gaming aggregators, which are the two real-world application domains focused in this work. The challenges faced in these application areas are the dynamic nature of user preferences and the difficulty of generating recommendations for less popular content, due to the overwhelming choice and polarisation of available top content. We present the design and implementation of a Reinforcement Learning-based Recommendation System (RLRS) for web applications, using a Deep Deterministic Policy Gradient (DDPG) agent and, as a reward function, a weighted sum of the user Click Distribution (CD) across the recommended items and the Dwell Time (DT), a measure of the time users spend interacting with those items. Our system has been deployed in real production scenarios with preliminary but promising results. Several metrics are used to track the effectiveness of our approach, such as content coverage, category diversity, and intra-list similarity. In both scenarios tested, the system shows consistent improvement and adaptability over time, reinforcing its applicability.

CloseRead Abstract

252
4495