HPC system and job monitoring with LLview
Date: December 7, 2022 | 4 p.m. (UTC)
Speakers: Vitor Silva and Filipe Guimarães, Jülich Supercomputing Centre
Moderator: Esteban Mocskos, Universidad de Buenos Aires
LLview is a monitoring infrastructure developed by the Jülich Supercomputing Centre with the objective to provide an easy to use and adaptable software suite for monitoring High Performance Computing systems. With the emergence of large heterogeneous machines, in the range of Exascale, the challenges of monitoring such huge systems increase significantly. To address that, LLview is under continuous development in order to work for a wide range of hardware systems and software interfaces with negligible overhead and at the same time providing fast, reliable access to job reports, system-wide monitoring data, and real-time system information. That information is provided to system users, project advisors, support teams and system administrators, helping the managing of jobs, identification of performance issues at many levels and also helping the system administrators to find failures and system malfunctions. This webinar gives an overview of the different LLview components and their interaction with each other and the system. Moreover, particular attention is drawn to the system monitoring views and the job reporting features, as they allow to trace the entire life cycle of a job and can help identify problems and bottlenecks at a very early stage.
About the speakers:
Vitor Silva received his Computer Science degree from Universiade Federal de Minas Gerais. His M.Sc was earned in Systems and Computer Engineering from Universidade Federal do Rio de Janeiro and later received his Ph.D from Universidade Federal de Minas Gerais, this time in Nuclear Engineering. He worked as software developer in the digital image processing field, but most of his career was in the Nuclear Engineering field, mainly working with computer modeling and solving Neutronics and Thermal-hydraulics problems related to nuclear reactors. He was also the main admin of a small cluster system installed from scratch. Since 2021 he has worked at the Jülich Supercomputing Centre with monitoring tools and simulation.
Filipe Guimarães is a computational physicist. Graduated in Physics, M.Sc in Physics and Ph.D in Physics from the Universidade Federal Fluminense. He has been working with High Performance Computing since 2014 – initially from a user’s side, but moved to the support side in 2020. Since then, one of his focuses was to improve monitoring tools used and developed at the Jülich Supercomputing Centre.
About the Moderator: Esteban Mocskos is a full-time professor at Universidad de Buenos Aires (UBA) and researcher at the Center for Computer Simulation (CSC-CONICET). He received his Ph.D. in Computer Science from UBA in 2008 and was postdoc at the Protein Modelling group at UBA. His research interests include distributed systems & blockchain, computer networks, processor architecture, and parallel programming. He is part of the steering committee of the Latin-American HPC CARLA conference and onE of the committee members of Argentina’s National HPC system.