Zdeněk Letko - Diplomová práce

Webová prezentace pro soutěž ČSKI.

Introduction

Finding concurrency bugs in complex software is extremely difficult because concurrency bugs do not manifest themselves predictably. As a contribution to coping with this problem the thesis proposes an architecture for a fully automated dynamic detection and healing of two kinds of concurrency bugs - data races and atomicity violations. Two distinct algorithms for detecting of data races are presented. One of them is a novel algorithm called AtomRace which detects data races as a special case of atomicity violations. The thesis also proposes several techniques for automatic healing (correcting) of detected bugs. The healing techniques applying at run-time are based on suppressing a recurrence of the detected problem. Basically forces certain parts of the code to be executed in a correct way. The architecture and algorithms were implemented and tested on multiple case studies include two complex real life software products provided by IBM and Telefonica.

Studied Concurrency Bugs

A data race occurs when two concurrent threads access a shared variable and when: at least one access is a write and the threads use no explicit mechanism to prevent the accesses from being simultaneous. The value of such shared variable is then unpredictable. Data races often appear hand-in-hand with atomicity violations. An atomicity violation occurs if a block of code that operates with some shared variable and should be executed atomically is interleaved by another thread that access the shared variable in an unserializable way. A simple example is given in Figure [1].

An example of atomicity violation.

Figure 1: An example of atomicity violation.

The figure depicts an example of a line of code simultaneously executed by two different threads. The code changes a counter, e. g., a counter of already sold airplane tickets. The simple statement sold_counter++; is in Java byte-code expressed by three shown instructions. Due to atomicity violation and unfortunate interleaving of byte-code instructions, the value of the counter is wrong (counter shows that only one ticket was sold instead of two). It is also evident that the problem manifests only when an unfortunate interleaving happens. Because the bug manifests itself only rarely, it is very difficult to find such a bug. The technology proposed in this thesis is able to detect occurrence of such problems and correct them at run-time.

Overview of Proposed Technology

The proposed technology for dynamic detection and healing of introduced bugs is depicted in Figure [2]. The process of fully automatic bug detection and healing can be divided into two parts. Firstly, a preprocessing is done before the analyzed program is executed. Then, a dynamic analysis is used to detect and heal bugs during the execution.

Proposed technology overview.

Figure 2: Proposed technology overview.

The detection and healing process consists of these seven tasks:

  1. Bytecode Instrumentation – IBM concurrency testing tool ConTest is used to instrument the given bytecode. Instrumentation injects into original bytecode additional instructions which are then used for tracking the execution.

  2. Obtaining Atomicity – Bytecode static analyzer FindBugs is then used to detect those blocks of code that should be executed without unwanted interleaving. There are two different approaches which can be used to obtain these blocks of code:

    • Pattern Based Static Analysis – Detects typical programming constructions.

    • Access Interleaving Invariants Based Static Analysis – Detects two immediately consequent accesses to a shared variable – candidates for blocks of code that should not be interleaved. Then, a dynamic analysis must be used to prune this set of candidates to contain only those which holds for the program. This way, it is possible to learn the correct atomicity of the program.

  3. Program Execution – Instrumented program is executed in the same way as the original program.

  4. Execution Monitoring – Instructions injected by the instrumentation allows ConTest to track the execution and propagate obtained information to the detector and healer in the form of a stream of events describing the behavior of the executed program. Moreover, injected instructions also allows ConTest and race detector and healer to take control of the execution.

  5. Problem Detection – The stream is analyzed by a detection algorithm. Two algorithms for detecting data races were implemented:

    • Eraser+ Algorithm – This algorithm is a modified Eraser algorithm which tries to infer information which lock is used to protect a shared variable. When a shared variable is not protected by a lock, data race could occur. The original algorithm was improved by a support of join synchronization and modified to reflect Java environment. These modifications rapidly decreased the number of false alarms (spurious warning about data races). But, the algorithm can still produce a false alarm in a case when different synchronization than locks is used. Suppressing of false alarms is very important because otherwise the technology would heal unreal bugs.

    • AtomRace Algorithm – A novel algorithm detects data races as a special cases of atomicity violations. A primitive atomic section is defined around each access to a shared variable (the section starts immediately before access instruction and ends immediately after the access instruction). When such primitive atomic sections defined over the same variable overlap, a data race is detected. This approach does not suffer from false alarms and, moreover, supports detection of violations of more general atomic sections obtained during the preprocessing.

  6. Problem Analysis – When a problem is detected, a report is recorded into a log file. This report contains enough information so a software developer is able to identify and understand the cause of the detected problem.

  7. Healing – Data races and atomicity violations are usually caused by a missing or wrongly used synchronization. Two approaches for coping with this problem at run-time were implemented:

    • Additional Synchronization – A new lock is attached to the problematic shared variable. The lock is acquired each time a thread is about to access the variable and released after access. The lock is not released in between of accesses that should be executed atomically. This approach heals the problem totally but can cause another concurrency problem - deadlock.

    • Legal Influencing of the Scheduler – The healer injects into execution a Java instruction(s) which cause either a context switch, a delay, or a change of a thread priority. The idea behind is to change threads interleaving scenario so that the synchronization is not necessary. This approach can only decrease the probability of a problem manifestation. Experiments showed that in some cases this approach helps.

Experimental Results

The proposed technology for a dynamic data race detection and healing was tested on three different test cases which differ in size and conditions under which data races manifest. The first test case was simple well known Sun's bank example which was chosen to examine the basic behavior of the proposed healing techniques. Two others test cases were real software products from IBM and Telefonica.


Implemented prototype was able to detect all known data races and, moreover, also some previously unknown data races and atomicity violations. These new bugs were reported to developers and corrected by them. The success in detection phase was in a novel combination of the dynamic data race detection algorithm and noise injection techniques provided by the ConTest tool. Noise injection allows to see more possible and still legal threads interleaving during multiple execution of a test and so gives the algorithm opportunity to analyze more interleaving scenarios. A new heuristics for the noise injection were also developed apart of the work on the thesis and successfully tested. The new heuristics improved the successfulness of detection algorithms.

Healing efficiency.

Figure 3: Healing efficiency.

The results of some proposed healing techniques can be seen from a graph in Figure [3]. The graph shows efficiency of healing of a data race in the Sun's Bank example executed on a four processors machine. The red line represents the percentage of the data race manifestation when the program is executed without the race detector and healer. The green line represents situation when detection is enabled and healing is disabled. The probability of the race manifestation is higher because a time window needed for problematic interleaving is longer due to injected code of ConTest and race detector and healer. Then, when healing was enabled, the probability of the race manifestation decreases as can be seen in the case of the blue line. In the case, when only two threads are in conflict which can lead to a bug manifestation, the healing technique suppress the bug totally. The violet line shows that healing based on introducing an additional synchronization, not surprisingly, suppresses the race manifestation totally in all cases.

Conclusions

There are three main contributions of the thesis to the current state of coping with concurrent bugs. Firstly, a novel algorithm AtomRace for dynamic detection of data races and atomicity violations was proposed and compared with a modified well known Eraser algorithm. AtomRace does not produce false alarms, supports all kinds of possible synchronization, and is also able to detect violations of more general atomicity. Secondly, a combination of algorithms for dynamic detection with a noise injection technique was studied. It showed up that with suitable heuristics for a noise injection this combination leads to significant improvements in the detection efficiency. And finally, healing techniques for healing of data races at run-time were introduced. Some of these techniques are able to make a buggy code to behave correctly during the execution.