Research Opportunities

Distributed Systems

[Closed]

Work description

Responsibilities under the grant: - Design techniques and mechanisms for managing the performance and energy consumption of GPUs used in deep learning within distributed environments. - Integration and evaluation of the proposed techniques in large-scale, high-performance computing environments (i.e., supercomputers). - Conduct experimental evaluations of the developed techniques, using a variety of deep learning models and hardware devices (e.g., various processing and storage devices). - Production of technical reports and scientific articles.

Academic Qualifications

- Enrolled in the doctoral program of informatics or informatics engineering.

Minimum profile required

- Solid knowledge and experience in the design of machine learning, deep learning, and large-language models (i.e., ResNet18, ResNet50, AlexNet, VGG19, Llama, Qwen, GPT).- Solid knowledge of the training pipeline and respective performance bottlenecks.- Knowledge and experience with high-performance computing environments, including scripting, experimental evaluations, collection and analysis of performance, resource usage, and energy consumption metrics.

Preference factors

- Experience with deep learning frameworks, including PyTorch, TensorFlow, and DeepSpeed. - Knowledge of performance and energy consumption optimizations designed for DL training. - Knowledge in operating systems and distributed systems. - Experience with Python and C++ programming languages.

Application Period

Since 26 Jun 2025 to 09 Jul 2025

[Closed]

Centre

High-Assurance Software

Scientific Advisor

Ricardo Gonçalves Macedo