Research made in INESC TEC earns award for pioneering work to extract events from texts written in Portuguese
The paper "Event Extraction for Portuguese: A QA-driven Approach using ACE-2005" won the Best Student Paper Award at the 22nd Portuguese Conference on Artificial Intelligence (EPIA’23). This research work led to the development of an event extraction framework for the Portuguese language. The solution differs not only by targeting Portuguese texts, but by allowing (in addition to the identification and classification of event triggers) the extraction of the arguments associated with the event, namely participants and attributes.
29th September 2023
"There is currently a vast amount of data. However, a significant part of this information is in text, making its automatic processing complex. The Information Extraction domain seeks to address this challenge by developing various techniques for extracting information from texts, towards generating structured data - since one of the main tasks is the extraction of events to identify and classify those that occur in texts", explained Luís Filipe Cunha.
According to the INESC TEC researcher, this is a technique with great potential for application in different areas, and it may benefit, for instance, the development of Knowledge Base Graphs, Natural Language Understanding, summarisation, or recommendation systems. However, and although there are already several event extraction systems in English, “they show limited portability to other languages due to their dependence on annotated textual resources in said language”. The main goal of this research work was the development of a solution for the extraction of events in Portuguese.
"The extraction of events for the Portuguese language is an under-explored area. Most of the works that resemble ours are limited to the detection of events, i.e., identification and classification of event triggers. However, our work not only focuses on the extraction of triggers, but also on the arguments associated with the event: participants and attributes", mentioned Luís Filipe Cunha. More specifically, the work proposes a new method, which involves two steps: on the one hand, the classification and identification of the key word of an event, i.e., the trigger, and, on the other hand, the extraction of event arguments using a Q&A – Question Answering - model.
"The method includes the fine-tuning of the BERTimbau language model - a BERT model based on the Transformers architecture introduced by Google, in 2017. This model was previously pre-trained with many Portuguese texts, allowing it to acquire knowledge about the vocabulary and language used in said texts. Our work focused on exploring the model's knowledge, adapting it (fine-tuning) for the task of extracting events in the Portuguese language. In practical terms, we adjusted the model parameters using event annotation data included in the corpus ACE-2005 - the reference in event extraction -, previously annotated manually by the Linguistic Data Consortium”.
According to the researcher, since the first version of the ACE-2005 corpus for the Portuguese language was produced within the scope of this work, the team was the first to use this dataset to train event extraction models for the language. "Also, and to our knowledge, this work was the first to use Q&A models in the extraction of events in the Portuguese language," he added.
The solution integrates the PhD work by Luís Filipe Cunha, a student at the Faculty of Sciences of the University of Porto (FCUP) - supervised by Alípio Jorge and Ricardo Campos, researchers at INESC TEC and professors at FCUP and the University of Beira Interior, respectively; it's also part of two projects funded by the Foundation for Science and Technology (FCT), Text2Story and StorySense. According to the researcher, the award received and the publication of the paper in Springer's Lecture Notes in Artificial Intelligence (LNAI) validates the work that has been carried out in the development of Natural Language Processing models focused on the Portuguese language, contributing to a decrease in dependence on the English language and an increase in resources that may be the basis for the development of other applications in the field of AI and natural language processing.
The models are available on the Huggingface Hub repository. As to the future, Luís Filipe Cunha mentioned that the goal is to explore new data sets to achieve a greater scope and diversity of types of events, while improving models with other neural network architectures, like Graph Neural Networks.
The paper was awarded at EPIA’23, which took place between September 5 and 8, in Faial (Azores). The conference is organised by the Portuguese Association for Artificial Intelligence (APPIA).