Distributed Systems
Work description
Responsibilities under the grant: - Design techniques and mechanisms for managing the performance and energy consumption of GPUs used in deep learning within distributed environments. - Integration and evaluation of the proposed techniques in large-scale, high-performance computing environments (i.e., supercomputers). - Conduct experimental evaluations of the developed techniques, using a variety of deep learning models and hardware devices (e.g., various processing and storage devices). - Production of technical reports and scientific articles.
Academic Qualifications
- Enrolled in the doctoral program of informatics or informatics engineering.
Minimum profile required
- Solid knowledge and experience in the design of machine learning, deep learning, and large-language models (i.e., ResNet18, ResNet50, AlexNet, VGG19, Llama, Qwen, GPT).- Solid knowledge of the training pipeline and respective performance bottlenecks.- Knowledge and experience with high-performance computing environments, including scripting, experimental evaluations, collection and analysis of performance, resource usage, and energy consumption metrics.
Preference factors
- Experience with deep learning frameworks, including PyTorch, TensorFlow, and DeepSpeed. - Knowledge of performance and energy consumption optimizations designed for DL training. - Knowledge in operating systems and distributed systems. - Experience with Python and C++ programming languages.
Application Period
Since 26 Jun 2025 to 09 Jul 2025
Centre
High-Assurance Software