2023
Authors
Campos, R; Jorge, AM; Jatowt, A; Bhatia, S; Litvak, M; Cordeiro, JP; Rocha, C; Sousa, H; Mansouri, B;
Publication
SIGIR Forum
Abstract
2023
Authors
Sousa, H; Guimaraes, N; Jorge, A; Campos, R;
Publication
2023 IEEE INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WI-IAT
Abstract
The importance of systems that can extract structured information from textual data becomes increasingly pronounced given the ever-increasing volume of text produced on a daily basis. Having a system that can effectively extract such information in an interoperable manner would be an asset for several domains, be it finance, health, or legal. Recent developments in natural language processing led to the production of powerful language models that can, to some degree, mimic human intelligence. Such effectiveness raises a pertinent question: Can these models be leveraged for the extraction of structured information? In this work, we address this question by evaluating the capabilities of two state-of-the-art language models - GPT-3 and GPT-3.5, commonly known as ChatGPT - in the extraction of narrative entities, namely events, participants, and temporal expressions. This study is conducted on the Text2Story Lusa dataset, a collection of 119 Portuguese news articles whose annotation framework includes a set of entity structures along with several tags and attribute values. We first select the best prompt template through an ablation study over prompt components that provide varying degrees of information on a subset of documents of the dataset. Subsequently, we use the best templates to evaluate the effectiveness of the models on the remaining documents. The results obtained indicate that GPT models are competitive with out-of-the-box baseline systems, presenting an all-in-one alternative for practitioners with limited resources. By studying the strengths and limitations of these models in the context of information extraction, we offer insights that can guide future improvements and avenues to explore in this field.
2024
Authors
Ströhle, T; Campos, R; Jatowt, A;
Publication
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS
Abstract
In our data-flooded age, an enormous amount of redundant, but also disparate textual data is collected on a daily basis on a wide variety of topics. Much of this information refers to documents related to the same theme, that is, different versions of the same document, or different documents discussing the same topic. Being aware of such differences turns out to be an important aspect for those who want to perform a comparative task. However, as documents increase in size and volume, keeping up-to-date, detecting, and summarizing relevant changes between different documents or versions of it becomes unfeasible. This motivates the rise of the contrastive or comparative summarization task, which attempts to summarize the text of different documents related to the same topic in a way that highlights the relevant differences between them. Our research aims to provide a systematic literature review on contrastive or comparative summarization, highlighting the different methods, data sets, metrics, and applications. Overall, we found that contrastive summarization is most commonly used in controversial news articles, controversial opinions or sentiments on a topic, and reviews of a product. Despite the great interest in the topic, we note that standard data sets, as well as a competitive task dedicated to this topic, are yet to come to be proposed, eventually impeding the emergence of new methods. Moreover, the great breakthrough of using deep learning-based language models for abstract summaries in contrastive summarization is still missing.
2023
Authors
Eder, L; Campos, R; Jatowt, A;
Publication
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023
Abstract
Versioned documents are common in many situations and play a vital part in numerous applications enabling an overview of the revisions made to a document or document collection. However, as documents increase in size, it gets difficult to summarize and comprehend all the changes made to versioned documents. In this paper, we propose a novel research problem of contrastive keyword extraction from versioned documents, and introduce an unsupervised approach that extracts keywords to reflect the key changes made to an earlier document version. In order to provide an easy-to-use comparison and summarization tool, an open-source demonstration is made available which can be found at https://contrastive-keyword-extraction.streamlit.app/.
2023
Authors
Litvak, M; Rabaev, I; Campos, R; Jorge, M; Jatowt, A;
Publication
CEUR Workshop Proceedings
Abstract
[No abstract available]
2023
Authors
Rabaev, I; Litvak, M; Younkin, V; Campos, R; Jorge, AM; Jatowt, A;
Publication
Proceedings of the IACT - The 1st International Workshop on Implicit Author Characterization from Texts for Search and Retrieval held in conjunction with the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), Taipei, Taiwan, July 27, 2023.
Abstract
This paper describes the shared task on Automatic Classification of Literary Epochs (CoLiE) held as a part of the 1st International Workshop on Implicit Author Characterization from Texts for Search and Retrieval (IACT’23) held at SIGIR 2023. The competition aimed to enhance the capabilities of large-scale analysis and cross-comparative studies of literary texts by automating their classification into the respective epochs. We believe that the competition contributed to the field of information retrieval by exposing the first large benchmark dataset and the first study’s results with various methods applied to this dataset. This paper presents the details of the contest, the dataset used, the evaluation procedure, and an overview of participating methods. © 2022 Copyright for this paper by its authors.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.