2023
Authors
Sousa, H; Guimaraes, N; Jorge, A; Campos, R;
Publication
2023 IEEE INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WI-IAT
Abstract
The importance of systems that can extract structured information from textual data becomes increasingly pronounced given the ever-increasing volume of text produced on a daily basis. Having a system that can effectively extract such information in an interoperable manner would be an asset for several domains, be it finance, health, or legal. Recent developments in natural language processing led to the production of powerful language models that can, to some degree, mimic human intelligence. Such effectiveness raises a pertinent question: Can these models be leveraged for the extraction of structured information? In this work, we address this question by evaluating the capabilities of two state-of-the-art language models - GPT-3 and GPT-3.5, commonly known as ChatGPT - in the extraction of narrative entities, namely events, participants, and temporal expressions. This study is conducted on the Text2Story Lusa dataset, a collection of 119 Portuguese news articles whose annotation framework includes a set of entity structures along with several tags and attribute values. We first select the best prompt template through an ablation study over prompt components that provide varying degrees of information on a subset of documents of the dataset. Subsequently, we use the best templates to evaluate the effectiveness of the models on the remaining documents. The results obtained indicate that GPT models are competitive with out-of-the-box baseline systems, presenting an all-in-one alternative for practitioners with limited resources. By studying the strengths and limitations of these models in the context of information extraction, we offer insights that can guide future improvements and avenues to explore in this field.
2023
Authors
Vinagre, J; Ghossein, MA; Peska, L; Jorge, AM; Bifet, A;
Publication
ORSUM@RecSys
Abstract
2023
Authors
Muhammad, SH; Abdulmumin, I; Ayele, AA; Ousidhoum, N; Adelani, DI; Yimam, SM; Ahmad, IS; Beloucif, M; Mohammad, SM; Ruder, S; Hourrane, O; Jorge, A; Brazdil, P; António Ali, FDM; David, D; Osei, S; Bello, BS; Lawan, FI; Gwadabe, T; Rutunda, S; Belay, TD; Messelle, WB; Balcha, HB; Chala, SA; Gebremichael, HT; Opoku, B; Arthur, S;
Publication
EMNLP
Abstract
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task 1. We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the different datasets and discuss their usefulness.
2023
Authors
Rabaev, I; Litvak, M; Younkin, V; Campos, R; Jorge, AM; Jatowt, A;
Publication
IACT@SIGIR
Abstract
This paper describes the shared task on Automatic Classification of Literary Epochs (CoLiE) held as a part of the 1st International Workshop on Implicit Author Characterization from Texts for Search and Retrieval (IACT’23) held at SIGIR 2023. The competition aimed to enhance the capabilities of large-scale analysis and cross-comparative studies of literary texts by automating their classification into the respective epochs. We believe that the competition contributed to the field of information retrieval by exposing the first large benchmark dataset and the first study’s results with various methods applied to this dataset. This paper presents the details of the contest, the dataset used, the evaluation procedure, and an overview of participating methods.
2023
Authors
Sousa, H; Campos, R; Jorge, A;
Publication
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023
Abstract
Temporal expression identification is crucial for understanding texts written in natural language. Although highly effective systems such as HeidelTime exist, their limited runtime performance hampers adoption in large-scale applications and production environments. In this paper, we introduce the TEI2GO models, matching HeidelTime's effectiveness but with significantly improved runtime, supporting six languages, and achieving state-of-the-art results in four of them. To train the TEI2GO models, we used a combination of manually annotated reference corpus and developed Professor HeidelTime, a comprehensive weakly labeled corpus of news texts annotated with HeidelTime. This corpus comprises a total of 138, 069 documents (over six languages) with 1, 050, 921 temporal expressions, the largest open-source annotated dataset for temporal expression identification to date. By describing how the models were produced, we aim to encourage the research community to further explore, refine, and extend the set of models to additional languages and domains. Code, annotations, and models are openly available for community exploration and use. The models are conveniently on HuggingFace for seamless integration and application.
2023
Authors
Litvak, M; Rabaev, I; Campos, R; Jorge, AM; Jatowt, A;
Publication
IACT@SIGIR
Abstract
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.