Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu


  • Name

    João Carlos Barbosa
  • Role

    Assistant Researcher
  • Since

    15th March 2021


Hybrid Image-/Data-Parallel Rendering Using Island Parallelism

Zellmann, S; Wald, I; Barbosa, J; Dermici, S; Sahistan, A; Gudukbay, U;


In parallel ray tracing, techniques fall into one of two camps: imageparallel techniques aim at increasing frame rate by replicating scene data across nodes and splitting the rendering work across different ranks, and data-parallel techniques aim at increasing the size of the model that can be rendered by splitting the model across multiple ranks, but typically cannot scale much in frame rate. We propose and evaluate a hybrid approach that combines the advantages of both by splitting a set of N x M ranks into M islands of N ranks each and using data-parallel rendering within each island and image parallelism across islands. We discuss the integration of this concept into four wildly different parallel renderers and evaluate the efficacy of this approach based on multiple different data sets.


LOOM: Interweaving tightly coupled visualization and numeric simulation framework

Barbosa, J; Navratil, P; Paulo Santos, L; Fussell, D;

ACM International Conference Proceeding Series

Traditional post-hoc high-fidelity scientific visualization (HSV) of numerical simulations requires multiple I/O check-pointing to inspect the simulation progress. The costs of these I/O operations are high and can grow exponentially with increasing problem sizes. In situ HSV dispenses with costly check-pointing I/O operations, but requires additional computing resources to generate the visualization, increasing power and energy consumption. In this paper we present LOOM, a new interweaving approach supported by a task scheduling framework to allow tightly coupled in situ visualization without significantly adding to the overall simulation runtime. The approach exploits the idle times of the numerical simulation threads, due to workload imbalances, to perform the visualization steps. Overall execution time (simulation plus visualization) is minimized. Power requirements are also minimized by sharing the same computational resources among numerical simulation and visualization tasks. We demonstrate that LOOM reduces time to visualization by 3 × compared to a traditional non-interwoven pipeline. Our results here demonstrate good potential for additional gains for large distributed-memory use cases with larger interleaving opportunities. © 2021 ACM.


A framework for efficient execution of data parallel irregular applications on heterogeneous systems

Ribeiro R.; Barbosa J.; Santos L.P.;

Parallel Processing Letters

Exploiting the computing power of the diversity of resources available on heterogeneous systems is mandatory but a very challenging task. The diversity of architectures, execution models and programming tools, together with disjoint address spaces and different computing capabilities, raise a number of challenges that severely impact on application performance and programming productivity. This problem is further compounded in the presence of data parallel irregular applications. This paper presents a framework that addresses development and execution of data parallel irregular applications in heterogeneous systems. A unified task-based programming and execution model is proposed, together with inter and intra-device scheduling, which, coupled with a data management system, aim to achieve performance scalability across multiple devices, while maintaining high programming productivity. Intra-device scheduling on wide SIMD/SIMT architectures resorts to consumer-producer kernels, which, by allowing dynamic generation and rescheduling of new work units, enable balancing irregular workloads and increase resource utilization. Results show that regular and irregular applications scale well with the number of devices, while requiring minimal programming effort. Consumer-producer kernels are able to sustain significant performance gains as long as the workload per basic work unit is enough to compensate overheads associated with intra-device scheduling. This not being the case, consumer kernels can still be used for the irregular application. Comparisons with an alternative framework, StarPU, which targets regular workloads, consistently demonstrate significant speedups. This is, to the best of our knowledge, the first published integrated approach that successfully handles irregular workloads over heterogeneous systems.