Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
About

About

Nuno Machado is a post-doctoral researcher at the High-Assurance Software Laboratory (HASLab) of University of Minho and INESC TEC.

His current research focuses on designing scalable and resilient distributed systems for storing and analyzing massive amounts of data.  He also works/has interest on privacy-aware solutions for cloud computing and IoT.

 

Nuno got a Ph.D. in Computer Science and Engineering from Instituto Superior Técnico (University of Lisbon), under the supervision of Luís Rodrigues. He worked on automated debugging techniques for multithreaded applications that allow developers to deterministically replay concurrency bugs, as well as isolate their root cause.

 

In the summer of 2014, Nuno was an intern at Microsoft Research (Redmond), working with Brandon Lucia on concurrency debugging.

Interest
Topics
Details

Details

  • Name

    Nuno Almeida Machado
  • Role

    External Research Collaborator
  • Since

    13th July 2016
Publications

2021

Horus: Non-Intrusive Causal Analysis of Distributed Systems Logs

Authors
Neves, F; Machado, N; Vilaca, R; Pereira, J;

Publication
51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021)

Abstract
Logs are still the primary resource for debugging distributed systems executions. Complexity and heterogeneity of modern distributed systems, however, make log analysis extremely challenging. First, due to the sheer amount of messages, in which the execution paths of distinct system components appear interleaved. Second, due to unsynchronized physical clocks, simply ordering the log messages by timestamp does not suffice to obtain a causal trace of the execution. To address these issues, we present Horus, a system that enables the refinement of distributed system logs in a causally-consistent and scalable fashion. Horus leverages kernel-level probing to capture events for tracking causality between application-level logs from multiple sources. The events are then encoded as a directed acyclic graph and stored in a graph database, thus allowing the use of rich query languages to reason about runtime behavior. Our case study with TrainTicket, a ticket booking application with 40+ microservices, shows that Horus surpasses current widely-adopted log analysis systems in pinpointing the root cause of anomalies in distributed executions. Also, we show that Horus builds a causally-consistent log of a distributed execution with much higher performance (up to 3 orders of magnitude) and scalability than prior state-of-the-art solutions. Finally, we show that Horus' approach to query causality is up to 30 times faster than graph database built-in traversal algorithms.

2019

Minha: Large-Scale Distributed Systems Testing Made Practical

Authors
Machado, N; Maia, F; Neves, F; Coelho, F; Pereira, J;

Publication
23rd International Conference on Principles of Distributed Systems, OPODIS 2019, December 17-19, 2019, Neuchâtel, Switzerland.

Abstract
Testing large-scale distributed system software is still far from practical as the sheer scale needed and the inherent non-determinism make it very expensive to deploy and use realistically large environments, even with cloud computing and state-of-the-art automation. Moreover, observing global states without disturbing the system under test is itself difficult. This is particularly troubling as the gap between distributed algorithms and their implementations can easily introduce subtle bugs that are disclosed only with suitably large scale tests. We address this challenge with Minha, a framework that virtualizes multiple JVM instances in a single JVM, thus simulating a distributed environment where each host runs on a separate machine, accessing dedicated network and CPU resources. The key contributions are the ability to run off-the-shelf concurrent and distributed JVM bytecode programs while at the same time scaling up to thousands of virtual nodes; and enabling global observation within standard software testing frameworks. Our experiments with two distributed systems show the usefulness of Minha in disclosing errors, evaluating global properties, and in scaling tests orders of magnitude with the same hardware resources. © Nuno Machado, Francisco Maia, Francisco Neves, Fábio Coelho, and José Pereira; licensed under Creative Commons License CC-BY 23rd International Conference on Principles of Distributed Systems (OPODIS 2019).

2019

Concurrency Debugging with MaxSMT

Authors
Terra Neves, M; Machado, N; Lynce, I; Manquinho, V;

Publication
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE

Abstract
Current Maximum Satisfiability (MaxSAT) algorithms based on successive calls to a powerful Satisfiability (SAT) solver are now able to solve real-world instances in many application domains. Moreover, replacing the SAT solver with a Satisfiability Modulo Theories (SMT) solver enables effective MaxSMT algorithms. However, MaxSMT has seldom been used in debugging multi-threaded software. Multi-threaded programs are usually non-deterministic due to the huge number of possible thread operation schedules, which makes them much harder to debug than sequential programs. A recent approach to isolate the root cause of concurrency bugs in multi-threaded software is to produce a report that shows the differences between a failing and a non-failing execution. However, since they rely solely on heuristics, these reports can be unnecessarily large. Hence, reports may contain operations that are not relevant to the bug's occurrence. This paper proposes the use of MaxSMT for the generation of minimal reports for multi-threaded software with concurrency bugs. The proposed techniques report situations that the existing techniques are not able to identify. Experimental results show that using MaxSMT can significantly improve the accuracy of the generated reports and, consequently, their usefulness in debugging the root cause of concurrency bugs.

2018

CoopREP: Cooperative record and replay of concurrency bugs

Authors
Machado, N; Romano, P; Rodrigues, L;

Publication
SOFTWARE TESTING VERIFICATION & RELIABILITY

Abstract
This paper presents CoopREP, a system that provides support for fault replication of concurrent programs based on cooperative recording and partial log combination. CoopREP uses partial logging to reduce the amount of information that a given program instance is required to store to support deterministic replay. This allows reducing substantially the overhead imposed by the instrumentation of the code, but raises the problem of finding a combination of logs capable of replaying the fault. CoopREP tackles this issue by introducing several innovative statistical analysis techniques aimed at guiding the search of the partial logs to be combined and needed for the replay phase. CoopREP has been evaluated using both standard benchmarks for multithreaded applications and real-world applications. The results highlight that CoopREP can successfully replay concurrency bugs involving tens of thousands of memory accesses, while reducing recording overhead with respect to state-of-the-art noncooperative logging schemes by up to 13x (and by 2.4x on average).

2018

Falcon: A Practical Log-based Analysis Tool for Distributed Systems

Authors
Neves, F; Machado, N; Pereira, J;

Publication
2018 48TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN)

Abstract
Programmers and support engineers typically rely on log data to narrow down the root cause of unexpected behaviors in dependable distributed systems. Unfortunately, the inherently distributed nature and complexity of such distributed executions often leads to multiple independent logs, scattered across different physical machines, with thousands or millions entries poorly correlated in terms of event causality. This renders log-based debugging a tedious, time-consuming, and potentially inconclusive task. We present Falcon, a tool aimed at making log-based analysis of distributed systems practical and effective. Falcon's modular architecture, designed as an extensible pipeline, allows it to seamlessly combine several distinct logging sources and generate a coherent space-time diagram of distributed executions. To preserve event causality, even in the presence of logs collected from independent unsynchronized machines, Falcon introduces a novel happens-before symbolic formulation and relies on an off-the-shelf constraint solver to obtain a coherent event schedule. Our case study with the popular distributed coordination service Apache Zookeeper shows that Falcon eases the log-based analysis of complex distributed protocols and is helpful in bridging the gap between protocol design and implementation.