msv-lab / symlog Goto Github PK

Symbolic executor of Datalog

License: Universal Permissive License v1.0

Python 99.23% Dockerfile 0.77%

symlog's Introduction

Symlog

Symlog is a symbolic executor of Datalog programs. It computes dependencies between the input and the output of a query by executing the query on a symbolic database. Please check more details in our publication:

Program Repair Guided by Datalog-Defined Static Analysis
Liu Yu, Sergey Mechtaev, Pavle Subotic, Abhik Roychoudhury
FSE 2023

Installation

Setting up Souffle. To install Souffle, execute the following commands in the terminal:

wget -P /tmp https://github.com/souffle-lang/souffle/releases/download/2.2/x86_64-ubuntu-2104-souffle-2.2-Linux.deb
sudo dpkg -i /tmp/x86_64-ubuntu-2104-souffle-2.2-Linux.deb
sudo apt-get install -f
rm /tmp/x86_64-ubuntu-2104-souffle-2.2-Linux.deb

Setting up Python Environment. First, ensure you have virtualenv installed. If not, you can install it using pip:

pip install virtualenv

Create and activate a virtual environment with Python 3.10:

virtualenv myenv --python=python3.10
source myenv/bin/activate

Install the required dependencies within the environment:

pip install -r requirements.txt
pip install -e .

Note: All Python dependencies are localized to the myenv environment. If you work in a different one, you will need to reinstall the dependencies.

To deactivate the virtual environment:

deactivate

Docker Image

To build the image:

sudo docker build -t symlog .

To run the container:

sudo docker run -it symlog

Usage Example

Creating Datalog Rules and Facts in Python

Use the following Python script to define Datalog rules and facts:

from symlog.shortcuts import (
    SymbolicConstant,
    String,
    Literal,
    Variable,
    Rule,
    Fact,
    SymbolicSign,
    symex,
)
from symlog.souffle import SYM

# the reachability rules
edge_relation1 = Literal(
    "edge", [Variable("x"), Variable("y")]
)  # default sign is positive
reachable_relation1 = Literal("reachable", [Variable("x"), Variable("y")])

edge_relation2 = Literal("edge", [Variable("x"), Variable("y")])
reachable_relation2 = Literal("reachable", [Variable("x"), Variable("z")])
reachable_relation3 = Literal("reachable", [Variable("y"), Variable("z")])

rules = [
    Rule(reachable_relation1, [edge_relation1]),  # reachable(x, y) :- edge(x, y).
    Rule(
        reachable_relation2, [edge_relation2, reachable_relation3]
    ),  # reachable(x, z) :- edge(x, y), reachable(y, z).
]

# the edge facts
edge_fact1 = SymbolicSign(
    Fact("edge", [SymbolicConstant("alpha", type=SYM), String("b")])
)
edge_fact2 = SymbolicSign(Fact("edge", [String("b"), String("c")]))  # edge('b', 'c').
edge_fact3 = Fact("edge", [String("c"), String("b")])  # edge('c', 'b').
edge_fact4 = Fact("edge", [String("c"), String("d")])  # edge('c', 'd').

facts = [edge_fact1, edge_fact2, edge_fact3, edge_fact4]

interested_fact = Fact("reachable", [String("a"), String("b")])

# get constraints
constraints = symex(rules, facts, {interested_fact})

print(constraints)

# {reachable("a", "b").: And(alpha == "a", edge("a", "b").)}

Loading Datalog Rules and Facts from Files

Alternatively, Datalog rules and facts can be loaded from file sources:

from symlog.shortcuts import (
    SymbolicConstant,
    String,
    Fact,
    SymbolicSign,
    symex,
    load_facts,
    parse,
    substitute,
)
from symlog.souffle import SYM

# the reachability rules
rules = parse("example/reachability.dl")

# the edge facts
facts = load_facts("example/", rules.declarations)

alpha = SymbolicConstant("alpha")

facts = set(
    map(lambda fact: SymbolicSign(substitute(fact, {String("a"): alpha})), facts)
)

facts = {SymbolicSign(f) if str(f) == 'edge("b", "c").' else f for f in facts}

interested_fact = Fact("reachable", [String("a"), String("b")])

# get constraints
constraints = symex(rules, facts, {interested_fact})

print(constraints)

# {reachable("a", "b").: And(alpha == "a", edge("a", "b").)}

Patches

Generated patches: https://drive.google.com/file/d/1PY6AY_jrVVVQuCFg9bpwdPNM5AOlqMw1/view?usp=drive_link

Limitations

Symlog supports only basic Souffle syntax. Features such as aggregation, arithmetic operations, and components are not supported.
Currently, Symlog supports only positive Datalog. If you wish to use semi-positive Datalog, you must first convert it to positive Datalog, if possible. Support for stratified negation will be added in the future.

Acknowledgements

This work was partially supported by a Singapore Ministry of Education (MoE) Tier 3 grant “Automated Program Repair”, MOE-MOET32021-0001, and by a gift from Oracle.

symlog's People

Contributors

Stargazers

Watchers

symlog's Issues

Write simple tests using Symlog API

Symlog API

Design intuitive API for Symlog.

Negations

We discuss how to handle negations when applying Symlog to program repair in this issue.

Symlog has symbolic signs, which indicate whether the facts associated with themselves are positive or negative. Negative represents the fact is removed. Removing facts from the EDB database will cause fewer tuples generated if the given Datalog program is positive. But if the Datalog program contains negation, removing facts may cause more tuples derived. So, unlike positive Datalog where derived tuples are within some scope, tuples produced by stratified Datalog programs are difficult to estimate if we remove facts arbitrarily.

Possessing derived tuples is crucial to program repair since some derived tuples represent some detected bugs, and the goal of repair is to remove or add some facts such that those tuples cannot be derived. In principle, Symlog is capable of computing all possible produced tuples and their associated constraints. However, similar to conventional symbolic execution, the search space grows exponentially with the size of the EDB database, since the generated tuples may vary with each subset of the EDB database.

To mitigate the search explosion, we compute an over-approximation of derived tuples. We do so by discarding all negative literals in the given stratified Datalog program P, resulting in a new program P'. Since P' does not have any negative literal, the set of tuples generated from it is a superset of that from P given the same facts. Besides, P' is positive and monotonic, thus the set of derived tuples with all facts is a superset of that with fewer facts. So, the set of tuples produced from P' with all facts is an over-approximation of all possible tuples generated from P with an arbitrary subset of facts. We denote the set of tuples that represent some detected bugs as S. S is produced by running P' under all facts. If executing P does not generate any tuple in S, the user program is said to be repaired. To eliminate tuples in S, we can assert negation of their associated constraints (C) and solve these assertions.

But here is a problem, the constraints are collected by running P'. To get the constraints of running P, we need to add more constraints on C. The added constraints (C') are for the existence of some tuples which are negated and discarded. The conjunction of C and (not C') is the set of full constraints for S. The solution of (not (C and (not C'))) corresponds to repair patches of the analyzed user program. The set of C' is obtained in the same way, which may also consist of two parts like C and C', and so on and so forth.

Some questions

What is the method for selecting the EDBs that will be symbolized to generate target tuples (see this issue)?
After identifying 1-minial set of facts that are required for target tuples, which of those facts should be removed to prevent the generation of the target tuples (see this issue)?
When creating positive queries using symbolic facts, there may be multiple options for the symbolic facts. However, many of these options will not be valid. For instance, an assignment like "If_Var(goto/6, _, r#123)" is incorrect because the first argument should be an 'if' expression. In addition to these obvious incorrect assignments, there may also be other, less obvious invalid assignments that appear in inappropriate positions. E.g., use if(var != null) to surround if(var != null)...else... block. How can we filter out these incorrect tuples? For the remaining valid tuples, how can we decide which one to return?

Currently, we use semantics checkers to filter out invalid tuples. The semantics checkers are python functions that take the tuples as inputs. We do not modify the Datalog program. For the valid tuples, we simply return the first one we find. Is this approach ok?
In order to generate patches in various forms, it is necessary to encode patch components into the Datalog program. However, this approach will make the Datalog program more complex. Is it possible to use a simple Datalog program to generate rough patches first, and then use a post-processing python function to convert the raw patches to different forms? Or, should we just include one patch template in the Datalog program, which means our repair method can only generate one kind of patch?
Is it possible to use traditional delta-debugging methods, such as directly removing facts, to identify the input tuples that lead to the target tuples? Our current method involves adding 0 or 1 to Datalog rules in order to record the facts that generate the head tuple. However, this causes the program to change during delta-debugging, making it necessary to recompile the program periodically. This can be time-consuming. Additionally, we cannot use interpreter mode as Souffle does not support rules with more than 20 arguments in this mode.
Do we need to run delta-debugging multiple times to find all 1-minimal sets for target tuples?

EDB/fact selection

The following questions pertain to choosing the appropriate EDBs and facts to repair programs.

What is the method for selecting the EDBs that will be symbolized to generate target tuples?

To repair NPE bugs, a method that selects the 'if' related facts is enough. But the method should be general to handle other types of bugs. Alternatively, should this method not be one-size-fits-all and require customization for specific types of bugs?
After identifying 1-minial set of facts that are required for target tuples, which of those facts should be removed to prevent the generation of the target tuples?

In order to fix NPE bugs, we should remove the negated facts (i.e., add the corresponding positive facts). Similar as 1., do we need to design a general method that can handle other kinds of bugs?

Tuple Explosion

Problem Description

Sometimes the number of generated tuples is too large, making it impossible to finish the Datalog program without running out of memory. Even if the program can finish, the generated analysis result file may be too large in size, e.g. 1.1GB (may not be an issue?)

Reason

The number of generated tuples is influenced by the sizes of their corresponding domains. For example, in the following code, tuples like r(1, 2, _, _, _) are generated by tuples like e(1, 2, _, _) and n(1, _). Tuples like e(1, 2, _, _) are generated by domain_alpha(_) and domain_beta(_) tuples; n(1, _) tuples are produced by domain_gamma(_) tuples. So, the number of the r(1, 2, _, _, _) tuples is equal to the product of the sizes of the domains alpha, beta, and gamma: |domain_alpha(_)| * |domain_beta(_)| * |domain_gamma(_)|. If the sizes of these domains are not small, the generated tuples can be excessively numerous.

r(x, x, t1, t2, t3) :-
     n(x, t3),
     domain_alpha(t1),
     domain_beta(t2),
     domain_gamma(t3).
r(x, y, t1, t2, t3) :-
     r(x, z, t1, t2, t3),
     e(z, y, t1, t2),
     domain_alpha(t1),
     domain_beta(t2),
     domain_gamma(t3).
e(1, 2, t1, t2) :-
     domain_alpha(t1),
     domain_beta(t2).
e(t1, t2, t1, t2) :-
     domain_alpha(t1),
     domain_beta(t2).
n(1, t3) :-
     domain_gamma(t3).
n(t3, t3) :-domain_gamma(t3).