seer-lab / arc Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 0.0 751 KB

A tool to automatically repair concurrency bugs in Java.

Home Page: http://www.sqrlab.ca/software/arc/

License: Mozilla Public License 2.0

Python 49.92% TXL 50.08%

bugfix concurrency genetic-algorithm java

arc's People

Contributors

Stargazers

Watchers

arc's Issues

Automatically adjust ConTest's KingProperties file

Currently we have to manually adjust the ConTest's KingProperties file for every new program. Basically we need to make it ConTest will apply to the current system under test. To do this we need to modify the KingProperties file to specify the targetClasses and sourceDirs.

Create own Evolutionary Algorithm for ARC

Due to issues with Pyevolve, we will create and use our own Evolutionary Algorithm in ARC.

Most of the algorithm components are present, such that really only the main loop is required.

Pruning/restarting weak individuals

It might be the situation where certain individuals might stray down a path that only results in poor success rates. An option that could be present is to prun/restart the lowest x scoring individuals (or lowest x% of the population).

This might be too strict and a better variation might be to only perform the pruning if the said individual has been in the lower bound for n generations.

Implement non-functional fitness score

The non-functional fitness score measures performance of the evolved software. This being said, we are looking at measuring the time resources used during the execution of the system under test.

We can acquire a large set of statistics using the Linux /usr/bin/time -v command. Unfortunately this will break the cross-platform nature to some extent (we have not tried ARC on Windows nor Mac yet though).

The result of this command looks like the following

    Command being timed: "python2 arc.py"
    User time (seconds): 299.15
    System time (seconds): 20.79
    Percent of CPU this job got: 111%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 4:46.96
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 89368
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1433714
    Voluntary context switches: 20243
    Involuntary context switches: 16198
    Swaps: 0
    File system inputs: 8
    File system outputs: 3072
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

From this we can extract the maximum memory used with the Maximum resident set size field. I am not so sure memory is of a concern though. We can also figure the real-time of the process running on the CPU using User time + System time. We also have the Wall-time, though other process could effect this and I believe the real-time would be a better measure.

We can then calculate the score like so (we can also take into account avg/min/max values if we want):
non-functional score = (max_memory * m) + (real_time * n) + ((wall_time * p)
Again not too sure if we need wall-time or max memory.

Ability to switch from functional phase to non-functional phase

The evolutionary algorithm works on optimizing the target software's functional or non-functional fitness score. The general process is to first optimize the functional fitness to ensure the correctness of the software and then to optimize the non-functional fitness score. The second phase attempts to optimize the software (in the most correct state found from the previous phase) for best performance without sacrificing any functionality.

Work in the functional phase till the functional termination occurs (no changes in the score for a while, or met a minimum fitness threshold). When the termination occurs a second more focused test is applied to ensure the individual indeed works correctly. Then the phase will switch to non-functional, when this switch occurs the population will now transform into the best found functional individual. Work is then done in the non-functional phase till the non-functional termination occurs (no changes in the score for a while). It must be noted that any individual that reduces the functional score is discarded.

When working off of the correct software found from the first phase we can assume that no errors are present. This assumption will allow us to reduce the timeout parameter to an average of x runs of the correct software (plus some cushion time.

Enhance functional fitness function

The functional fitness is currently given a score based on the success rate. The problem with this is that it is hard to distinguish two individuals that have the same success rate.

The proposed solution to this is to consider the timeouts as well. A timeout could have resulted in a successful execution if the time limit was extended. Thus timeouts should effect the fitness (worse then a success, but better then deadlock/datarace).

Therefore the functional fitness score could be:

score = (# of success * n) + ( # of timeouts * m)

n in this case represents the factor of weighting for successful executions, while m is the weighting for timeouts.

It might also be worth considering the density of dataraces in an individual by using the number of errors in the testsuite.

Test upload branch

Trying another way to upload latest changes

Handle situation when compiling an individual fails

After every mutation, an individual gets compiled before the ConTest evaluation phase can occur. In a situation where the mutation causes the individual to fail in compiling there should be some handling to chose another mutation until mutations are exhausted (therefore restarting the individual) or a valid mutation compiles.

It might be more appropriate to move the compiling phase directly to the mutation method (to keep them tightly coupled).

Generalize arc to work on projects

Arc must work on all java files in a project
How are we going to record the changed files and still be able to compile the project?

Add TXL files to ARC

Adding txl files (who do mutation) to the arc repository.

Detect and reject equivalent mutants

Currently, if two different members of the population arrive at the same program structure after mutation, each is evaluated. Time would be saved if after we do the first evaluation, we record a hash of the mutated program and the result of the evaluation. For every new mutation, we can check against the hash list. If we find it, there is no need to test it again. Reject the mutation and try again.

Add new TXL operators (shared variables)

Synchronize on shared variable (using ConTest's sharedVars.txt to detect them)
Change lock object to different shared variable (again using the sharedVars.txt)

Some how need to make TXL replace variables based on the contents of a file (randomly).

Convert existing source code to Python 2.7 and fix minor formatting

The existing source code must be converted to Python 2.7 to work with the Pyevolve framework. Currently the source code is written in Python 3.2.

Additionally, some of the formatting must be adjusted.

The README must also reflect this change.

Create Basic Pyevolve module

Create the individual representation that will be used in Pyevolve's evolutionary approach.

A 2D list of binary values (like the table presented) to act as the genome for the individual. Additionally, there will be values that each individual will track:

Mutation Operator 1:	0	0
Mutation Operator 2:
Mutation Operator 3:	0

Unique ID for the individual
Last used mutation operator
Set of mutation operators applied thus far
% of successes in last execution
% of deadlocks in last execution
% of dataraces in last execution
% of timeouts in last execution
% of errors in last execution

Mutation is a single change in the 2D list of binary values.
There is no crossover.

Finish ConTest module

Finish the ConTest module that will perform the following:

Backup the original Java classes
Instrument the original Java classes with noise
Test the software system a set number of times
- Check for deadlocks using JVM's thread dump
- Check for timeouts using a configurable time
- Check for dataraces if the test suite fails
- Check for success if the test suite passes
Restore the original Java classes

Improve set of mutation operators

Some of the mutation operators can be improved:

ASAV
ASM

We need some mutation operators to address the new Java 5 concurrency structures:

i.e., notify vs. notifyall

We need a mutation operator that changes the lock object to that of another shared object

Include alternative terminating criteria

Currently the evolution process will only terminate if an individual meets 100% success rate or if the generation limit is reached.

It might be more applicable to include alternative terminating criteria that can also be used such as:

If the best individual has not improved in the last x generations (either strict number, or a percent of current generation)
If the mean individual fitness has not improved in the last x generations (same as above)?

Dynamic ranking of mutation operators based on feedback

The mutation operators seem very general in terms of their applicability (for example, out of all the operators there is total - 1 used for datarace situations, and total used for deadlock situations. The heuristic search capabilities are not going to be visible nor effective in this situation.

The following describes an adaption to the mutation operator selection. Based on the prior executions results if the selected operator ends up increase the rate of successful executions, then it can be said that the mutation operator was effective. Thus in future generations and individuals might benefit from more applications of said mutation operator. Therefore all the mutation operator will have a dynamic rank based on their historical success rate for this software.

The dynamic ranking can be determined using the following formula:

rank_score += current_successes - previous_successes

The rank scores are then sorted and the highest scores are ranked first. From this ranking we then want to weight the percentages of a mutation operator to be selected for application. To avoid converging on only a few operators a floor and ceiling will be applied to the percentage modifiers. Most likely the format will be something like best rated will be +10%, while worst will be -10%, second best will be +8% while second worst will be -8%, etc... This approach still keeps all the operators in the play regardless of how they are ranked.

Short-circuit evaluation/validation

ARC currently mutates all individuals than evaluates all individuals and finally validates the potential solutions for a fix. The downside to the current implementation is if a potential fix is found in the first individual ARC does not evaluate it till after all individuals have been evaluated.

The proposed optimization is to evaluate->validate each individual as a whole. This way if a fix is found in early individuals ARC does not have to wait until all individuals are evaluated.

Create publication

Create a 10 page paper about ARC.

Fixing out of order operations in mutate function

Part (all?) of the problem we are having with projects not coming from the right source is in the
mutate function. For example, 'create representation' was being called before 'create local
project' and 'mutate project'.
The purpose of this branch is to attempt to fix these issues. It has already opened up additional
problems that I have created hotfixes for.

TODO:

Kevin, could you review the code I eventually submit? I have comments with 'DK' in them.
Could you look at those specifically? (The 'DK' is so you can search for them in a consistent
way.)

Evolution Replacement Interval

The current approach for replacement is to first wait n generations and then perform the replacement every generation thereafter. This approach is a bit harsh and illogical. The proposed fix is to only perform the replacement every n generations instead, effectively on a defined interval.

Update wiki with new genome and approach, Update README

The wiki must be updated with the new gnome representation, as well as the new approach for ARC.

Genome

The new genome representation is for a specific state of the individual. The only concerns are what mutations that can be applied to the individual (there is no crossover anymore, as the search consists of a heuristically guided mutations). Therefore, each bit represents a possible location where the associated mutation operator can be applied in the source code. The way that this works is that TXL will encounter the mutation's pattern in the source code in the same order, so it becomes possible to flip a bit and apply said mutation.

For example, this gnome can be acquired after checking how many pattern instances of mutation operator 1, 2 and 3 it detected.

Mutation Operator 1:	0	0
Mutation Operator 2:
Mutation Operator 3:	0

Applying a random mutation might flip the right most bit of mutation operator 3. The result of this is that the second mutation operator will be applied at the second instance of the TXL match for that mutation pattern.

Approach

The new approach does not use crossover as the problem seems ill-fitted for it. After some brainstorming there was no clear approach to incorporate crossover. Therefore, crossover was removed entirely which now changes the search algorithm into more of an evolutionary approach. The solution will now be found by simply applying mutation in a heuristically guided manner using prior test execution feedback.

Also update the README so it remains current.

Reset individual who runs out of mutations

We discovered a bug: If an individual runs out of mutations, it's mutation list [ASAV, ..., SHSB] stops growing. The adjust operator weights functions scans past the end of the list at higher generations causing an out of bounds exception. We are considering how to fix this.

When we switch from functional to non-functional, the score changes radically. (Decreases as NF is just starting.) The sliding window will consider both the functional AND non-functional scores when ranking non-functional operators. We don't want this.