Publications

Publications by Alípio Jorge

2024

text2story: A Python Toolkit to Extract and Visualize Story Components of Narrative Text

Authors
Amorim, E; Campos, R; Jorge, AM; Mota, P; Almeida, R;

Publication
LREC/COLING

Abstract
Story components, namely, events, time, participants, and their relations are present in narrative texts from different domains such as journalism, medicine, finance, and law. The automatic extraction of narrative elements encompasses several NLP tasks such as Named Entity Recognition, Semantic Role Labeling, Event Extraction, and Temporal Inference. The text2story Python, an easy-to-use modular library, supports the narrative extraction and visualization pipeline. The package contains an array of narrative extraction tools that can be used separately or in sequence. With this toolkit, end users can process free text in English or Portuguese and obtain formal representations, like standard annotation files or a formal logical representation. The toolkit also enables narrative visualization as Message Sequence Charts (MSC), Knowledge Graphs, and Bubble Diagrams, making it useful to visualize and transform human-annotated narratives. The package combines the use of off-the-shelf and custom tools and is easily patched (replacing existing components) and extended (e.g. with new visualizations). It includes an experimental module for narrative element effectiveness assessment and being is therefore also a valuable asset for researchers developing solutions for narrative extraction. To evaluate the baseline components, we present some results of the main annotators embedded in our package for datasets in English and Portuguese. We also compare the results with the extraction of narrative elements by GPT-3, a robust LLM model.

CloseRead Abstract

2024

Proceedings of Text2Story - Seventh Workshop on Narrative Extraction From Texts held in conjunction with the 46th European Conference on Information Retrieval (ECIR 2024), Glasgow, Scotland, UK, March 24, 2024

Authors
Campos, R; Jorge, AM; Jatowt, A; Bhatia, S; Litvak, M;

Publication
Text2Story@ECIR

Abstract

2024

DRL-KeyAgree: An Intelligent Combinatorial Deep Reinforcement Learning-Based Vehicular Platooning Secret Key Generation

Authors
Kurunathan, H; Li, K; Tovar, E; Jorge, AM; Ni, W; Jamalipour, A;

Publication
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Abstract
The exploitation of radio channels' inherent randomness for generating secret keys within a vehicular platoon offers a promising approach to securing communications in dynamic and unpredictable environments. The channel-based key generation leverages the fact that the physical characteristics of the radio channel, such as fading, shadowing, and multipath propagation, vary in a complex manner that makes it difficult for external adversaries to predict or replicate. A challenge lies in accurately assessing the channel's randomness to ensure the generated keys are both secure and consistent across the platooning vehicles, especially in vehicular environments with high mobility and the ever-changing urban landscape. This paper proposes a novel channel-based key generation (DRL-KeyAgree) technique to enhance communication security within vehicular platoons through combinatorial deep reinforcement learning (DRL). DRL-KeyAgree addresses key disagreement among platooning vehicles by training advantage Actor-Critic (A2C), which integrates policy-and value-based strategies to dynamically select optimal quantization intervals adapting to the random wireless channels. Further incorporation of Long Short-Term Memory (LSTM) allows DRL-KeyAgree to capture the characteristics of partially observable radio channels, significantly enhancing the key agreement rate among vehicles. DRL-KeyAgree is rigorously evaluated using the standard National Institute of Standards and Technology (NIST) test suite.

CloseRead Abstract

2024

ACE-2005-PT: Corpus for Event Extraction in Portuguese

Authors
Cunha, LF; Silvano, P; Campos, R; Jorge, A;

Publication
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024

Abstract
Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55% and 87.55% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.

CloseRead Abstract

2024

Keywords attention for fake news detection using few positive labels

Authors
de Souza, MC; Golo, MPS; Jorge, AMG; de Amorim, ECF; Campos, RNT; Marcacini, RM; Rezende, SO;

Publication
INFORMATION SCIENCES

Abstract
Fake news detection (FND) tools are essential to increase the reliability of information in social media. FND can be approached as a machine learning classification problem so that discriminative features can be automatically extracted. However, this requires a large news set, which in turn implies a considerable amount of human experts' effort for labeling. In this paper, we explore Positive and Unlabeled Learning (PUL) to reduce the labeling cost. In particular, we improve PUL with the network-based Label Propagation (PU-LP) algorithm. PU-LP achieved competitive results in FND exploiting relations between news and terms and using few labeled fake news. We propose integrating an attention mechanism in PU-LP that can define which terms in the network are more relevant for detecting fake news. We use GNEE, a state-of-the-art algorithm based on graph attention networks. Our proposal outperforms state-of-the-art methods, improving F-1 in 2% to 10%, especially when only 10% labeled fake news are available. It is competitive with the binary baseline, even when nearly half of the data is labeled. Discrimination ability is also visualized through t-SNE. We also present an analysis of the limitations of our approach according to the type of text found in each dataset.

CloseRead Abstract

2024

Text2Story Lusa: A Dataset for Narrative Analysis in European Portuguese News Articles

Authors
Nunes, S; Jorge, AM; Amorim, E; Sousa, HO; Leal, A; Silvano, PM; Cantante, I; Campos, R;

Publication
LREC/COLING

Abstract
Narratives have been the subject of extensive research across various scientific fields such as linguistics and computer science. However, the scarcity of freely available datasets, essential for studying this genre, remains a significant obstacle. Furthermore, datasets annotated with narratives components and their morphosyntactic and semantic information are even scarcer. To address this gap, we developed the Text2Story Lusa datasets, which consist of a collection of news articles in European Portuguese. The first datasets consists of 357 news articles and the second dataset comprises a subset of 117 manually densely annotated articles, totaling over 50 thousand individual annotations. By focusing on texts with substantial narrative elements, we aim to provide a valuable resource for studying narrative structures in European Portuguese news articles. On the one hand, the first dataset provides researchers with data to study narratives from various perspectives. On the other hand, the annotated dataset facilitates research in information extraction and related tasks, particularly in the context of narrative extraction pipelines. Both datasets are made available adhering to FAIR principles, thereby enhancing their utility within the research community.

CloseRead Abstract