2023
Authors
Ponte, L; Koch, I; Lopes, CT;
Publication
LEVERAGING GENERATIVE INTELLIGENCE IN DIGITAL LIBRARIES: TOWARDS HUMAN-MACHINE COLLABORATION, ICADL 2023, PT II
Abstract
An institution must understand its users to provide quality services, and archives are no exception. Over the years, archives have adapted to the technological world, and their users have also changed. To understand archive users' characteristics and motivations, we conducted a study in the context of the Portuguese Archives. For this purpose, we analysed a survey and complemented this analysis with information gathered in interviews with archivists. Based on the most frequent reasons for visiting the archives, we defined six main archival profiles (genealogical research, historical research, legal purposes, academic work, institutional purposes and publication purposes), later characterised using the results of the previous analysis. For each profile, we created a persona for a more visual and realistic representation of users.
2023
Authors
Alonso, O; Cousijn, H; Silvello, G; Marrero, M; Lopes, CT; Marchesin, S;
Publication
TPDL
Abstract
2023
Authors
Dias, M; Lopes, CT;
Publication
ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE
Abstract
Linked data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This article evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theater plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
2023
Authors
Koch, I; Lopes, CT; Ribeiro, C;
Publication
ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE
Abstract
Archives are facing numerous challenges. On the one hand, archival assets are evolving to encompass digitized documents and increasing quantities of born-digital information in diverse formats. On the other hand, the audience is changing along with how it wishes to access archival material. Moreover, the interoperability requirements of cultural heritage repositories are growing. In this context, the Portuguese Archives started an ambitious program aiming to evolve its data model, migrate existing records, and build a new archival management system appropriate to both archival tasks and public access. The overall goal is to have a fine-grained and flexible description, more machine-actionable than the current one. This work describes ArchOnto, a linked open data model for archives, and rules for its automatic population from existing records. ArchOnto adopts a semantic web approach and encompasses the CIDOC Conceptual Reference Model and additional ontologies, envisioning interoperability with datasets curated by multiple communities of practice. Existing ISAD(G)-conforming descriptions are being migrated to the new model using the direct mappings provided here. We used a sample of 25 records associated with different description levels to validate the completeness and conformity of ArchOnto to existing data. This work is in progress and is original in several respects: (1) it is one of the first approaches to use CIDOC CRM in the context of archives, identifying problems and questions that emerged during the process and pinpointing possible solutions; (2) it addresses the balance in the model between the migration of existing records and the construction of new ones by archive professionals; and (3) it adopts an open world view on linking archival data to global information sources.
2023
Authors
Koch, I; Pires, C; Lopes, CT; Ribeiro, C; Nunes, S;
Publication
LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES, TPDL 2023
Abstract
Archives preserve materials that allow us to understand and interpret the past and think about the future. With the evolution of the information society, archives must take advantage of technological innovations and adapt to changes in the kind and volume of the information created. Semantic Web representations are appropriate for structuring archival data and linking them to external sources, allowing versatile access by multiple applications. ArchOnto is a new Linked Data Model based on CIDOC CRM to describe archival objects. ArchOnto combines specific aspects of archiving with the CIDOC CRM standard. In this work, we analyze the ArchOnto representation of a set of archival records from the Portuguese National Archives and compare it to their CIDOC CRM representation. As a result of ArchOnto's representation, we observe an increase in the number of classes used, from 20 in CIDOC CRM to 28 in ArchOnto, and in the number of properties, from 25 in CIDOC CRM to 28 in ArchOnto. This growth stems from the refinement of object types and their relationships, favouring the use of controlled vocabularies. ArchOnto provides higher readability for the information in archival records, keeping it in line with current standards.
2024
Authors
Moas, PM; Lopes, CT;
Publication
ACM COMPUTING SURVEYS
Abstract
Wikipedia is the world's largest online encyclopedia, but maintaining article quality through collaboration is challenging. Wikipedia designed a quality scale, but with such a manual assessment process, many articles remain unassessed. We review existing methods for automatically measuring the quality of Wikipedia articles, identifying and comparing machine learning algorithms, article features, quality metrics, and used datasets, examining 149 distinct studies, and exploring commonalities and gaps in them. The literature is extensive, and the approaches follow past technological trends. However, machine learning is still not widely used by Wikipedia, and we hope that our analysis helps future researchers change that reality.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.