Informatics
[Open soon]
Work description
Responsibilities under the grant: - Desing of new mechanisms to monitor the performance and energy consumption of model training workloads in advanced computing infrastructures. - Design techniques and mechanisms to improve GPU performance and energy efficiency, with minimal impact onkey training metrics, such as execution time and accuracy. - Integration and evaluation of the proposed techniques in large-scale, high-performance computing environments (i.e., supercomputers). - Conduct experimental evaluations of the developed techniques, using a variety of deep learning models and hardware devices (e.g., various processing and storage devices). - Writing of technical reports and scientific papers.
Academic Qualifications
Enrolled in the doctoral program of informatics or informatics engineering.
Minimum profile required
Experience in designing energy monitoring tools, with particular focus on multi-threaded and distributed scenarios.Experience with observability tools, particularly OpenTelemetry.Solid knowledge and experience in machine learning, deep learning, and large-language models (i.e., ResNet18, ResNet50, AlexNet, VGG19, Llama, Qwen, GPT).Solid knowledge of the training pipeline and respective performance bottlenecks.Knowledge and experience with high-performance computing environments, including scripting, experimental evaluations, collection and analysis of performance, resource usage, and energy consumption metrics.
Preference factors
- Experience with deep learning frameworks, including PyTorch, TensorFlow, and DeepSpeed. - Knowledge of performance and energy consumption optimizations designed for DL training. - Knowledge in operating systems and distributed systems. - Experience with Python, C++, and Go programming languages.
Application Period
Since 20 Nov 2025 to 04 Dec 2025
[Open soon]
Centre
High-Assurance Software