galoisinc / cclyzerpp Goto Github PK

View Code? Open in Web Editor NEW

133.0 12.0 15.0 4.08 MB

cclyzer++ is a precise and scalable pointer analysis for LLVM code.

Home Page: https://galoisinc.github.io/cclyzerpp/

License: BSD 3-Clause "New" or "Revised" License

CMake 4.12% C++ 51.78% Shell 0.91% Python 16.10% C 25.71% Dockerfile 1.05% Nix 0.33%

program-analysis static-analysis pointer-analysis souffle datalog llvm

cclyzerpp's Introduction

cclyzer++

cclyzer++ is a precise and scalable global pointer analysis for LLVM code. The output of cclyzer++ can be used for a variety of program analysis tasks, including:

Creation of callgraphs with precise handling of indirect function calls and virtual method calls
Precise inter-procedural control- and data-flow analysis
Answering may-alias and must-not-alias queries

cclyzer++ is field- and array-sensitive, performs on-the-fly callgraph construction, and supports many different configurations of context-sensitivity including k-callsite sensitivity. It has subset-based (Andersen style) and unification-based (Steensgaard style) analyses. cclyzer++ is written in Soufflé Datalog, and so is highly parallel. cclyzer++ was derived from cclyzer.

See the documentation for more information about cczlyer++, including examples of its output. Documentation is also available online.

If you use cclyzer++ in your own work, please include the following citations:

Langston Barrett and Scott Moore. "cclyzer++: Scalable and Precise Pointer Analysis for LLVM." https://galois.com/blog/2022/08/cclyzer-scalable-and-precise-pointer-analysis-for-llvm/. 2022.
George Balatsouras and Yannis Smaragdakis. "Structure-sensitive points-to analysis for C and C++." In Static Analysis: 23rd International Symposium, SAS 2016, Edinburgh, UK, September 8-10, 2016, Proceedings 23, pp. 84-104. Springer Berlin Heidelberg, 2016.

Acknowledgments

This material is based upon work supported by the United States Air Force and Defense Advanced Research Project Agency (DARPA) under Contract No. FA8750-19-C-0004. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force or DARPA. Approved for Public Release, Distribution Unlimited.

cclyzerpp's People

Contributors

Stargazers

Watchers

Forkers

yihozhang celestialwy clayne mingodad flyfish101 venkyjntu-git maijun-sec canliture adrianherrera cyw3 peperunas loongwalker llvmnexus strogo

cclyzerpp's Issues

dl: Suppress remaining warnings

TIL about a Souffle pragma to suppress warnings on a per-relation basis:

.pragma "suppress-warnings" "relation_name"

We should use this pragma to suppress warnings that can't/shouldn't be fixed (i.e., all the warnings that fire on v0.3). That way, warnings that are printed will be relevant.

Extraneous relations from LogicBlox port

There are a few relations that exist simply to rearrange parameters of other relations, e.g.,

cclyzerpp/datalog/points-to/class-type.dl

Lines 81 to 83 in 91129d8

    
           .decl _typeinfo_by_class_name(?className: symbol , ?typeInfo: GlobalVariable) 
        
           _typeinfo_by_class_name(?className, ?typeInfo) :- 
        
              _typeinfo_class_name(?typeInfo, ?className).

These are leftover from when cclyzer was written in LogicBlox; they have no benefit (and indeed, have a memory and time cost) now. They should be removed.

ci: Compile with `-DCMAKE_BUILD_TYPE=Release` for releases

Releases should be built with optimization.

ci: Pin Ubuntu 22.04

To prevent CI from suddenly failing when ubuntu-latest is changed, we should pin it to Ubuntu 22.04.

Remove uses of records in datalog/schema

There are a few record types in datalog/schema:

phi-instr.dl
9:.type PhiInstructionPair = [

switch-instr.dl
10:.type SwitchInstructionCase = [

landingpad-instr.dl
129:.type Clause = [

These don't seem to provide value above and beyond a version that doesn't use records. Indeed, most (all?) come with auxiliary relations that simply convert them to non-record forms.

These seem tied up with the only relations coming out of the FactGenerator with names that start with underscores (see #42).

Naming conventions

There's really one kind of "block" in the LLVM world. LLVM simply calls them llvm::Block, and we should do the same for the sake of brevity.

cmake: Package shared libraries

CMake currently (as of #74) only packages the fact generator. It should also package the shared libraries.

ci: Attach Debian package to releases

ci: Release job is failing

https://github.com/GaloisInc/cclyzerpp/actions/runs/3208051718/jobs/5243563944

unauthorized: unauthenticated: User cannot be authenticated with the token provided.

This is probably a result of #44

Remove unfinished features like debug info handling

When beginning work on #12, I realized that I would have to update parts of the fact generator that deal with LLVM debug information. This seems tedious and unnecessary, given that I don't think the analysis really depends in an essential way on the content of the debug information. Indeed, notes in the code indicate that the current support for debug info is still incomplete:

cclyzerpp/datalog/points-to/class-type.dl

Line 5 in 91129d8

//UNFINISHED (needs debuginfo)

Similarly, there's a bunch of code related to C++ support that appears unfinished, either because it was never completed as part of cclyzer, or because it wasn't completely ported over from LogicBlox to Souffle:

cclyzerpp/datalog/points-to/cplusplus.dl

Line 156 in 91129d8

// secondary_superclass(Supertype, Type) <-

I propose to remove unfinished code from cclyzer++. While it is a shame to remove what might be valuable work, it's hard to imagine how this work could provide value without considerable amounts of reverse-engineering. There are no tests and very limited commentary describing what the code is supposed to do, which makes it hard to have any confidence that it's doing it correctly. In the meantime, like all code, it makes the project overall more costly to maintain, slows down builds, and makes it harder to understand the complete project.

Datalog file name convention

Some of the multi-word filenames use snake_case, some use kebab-case. No reason for them to be inconsistent.

cmake: Integrate packaging

pkg/pkg.sh packages cclyzer++ with fpm. It assumes CMake has already been run. It would be better to integrate this as a custom CMake target that declares its dependencies on other targets.

Program fails `assert_every_pointer_constant_points_to_something`

 </mate/copy.ll>:45:getelementptr inbounds ([0 x i8], [0 x i8]* @uc, i32 0, i32 0)

; ModuleID = '/tmp/reduce-py-cache/incompatible-sign.-O0.bc'
source_filename = "/mate/llvm-project/clang/test/Sema/incompatible-sign.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

@uc = external global [0 x i8], align 1
@constinit = private global [1 x i8*] [i8* getelementptr inbounds ([0 x i8], [0 x i8]* @uc, i32 0, i32 0)], align 8

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 10.0.0-4ubuntu1 "}

A wrapper executable

Running the analysis is a two-step process: You have to run the fact generator, output facts to some directory, then separately compile or interpret the analysis and pass the fact directory to it. We should provide a thin wrapper exectutable (one per analysis) that runs the fact generator and then the analysis. This would simplify the UX, and potentially enable avoiding the overhead of writing to disk, which might be substantial for smaller programs.

ci: Build and release artifacts for multiple LLVM versions

We now build against multiple LLVM versions. The CI system should release artifacts for each of them, rather than just the latest. The Debian packages and Docker images will each need new naming schemes to reflect which version of LLVM they're for.

Unify FactGenerator/Datalog relation names

As of #40, the Fact Generator and Datalog code share a list of file/relation names. The Fact Generator refers to relations by group::rel, e.g., variable::name, whereas that corresponds to the Datalog relation variable_name. While there is a clear correspondence between variable::name and variable_name, the relationship doesn't hold in general, e.g. we also have variable::id corresponding to just variable. Thus, predicates.inc has lines like:

PREDICATE(global_var, unmangl_name, global_variable_has_unmangled_name)

where the first two entries describe the C++ (Fact Generator) name, and the third entry describes the filename/Datalog relation name. We should try to derive the latter from the former for the sake of consistency. This will involve changing a ton of Datalog code to use new relation names.

bug: Callgraph changed in test program

Recent changes introduced a bug. In particular, the callgraph for this program is different between the initial commit (e224763) and a later one (d96ff8d) even though no functional changes have been introduced during this time.

C program

/*
MIT License

Copyright (c) 2019 yuawn

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>

void init() {
  setvbuf(stdout, 0, 2, 0);
  setvbuf(stdin, 0, 2, 0);
  setvbuf(stderr, 0, 2, 0);
}

int read_int() {
  char buf[0x10];
  __read_chk(0, buf, 0xf, 0x10);
  return atoi(buf);
}

void welcome_func() { puts("Hello ~~~"); }

void bye_func() { puts("Bye ~~~"); }

void menu() {
  puts("1. add a box");
  puts("2. exit");
  puts(">");
}

struct MessageBox {
  void (*welcome)();
  void (*bye)();
};

void backdoor() { system("sh"); }

int main() {

  init();

  struct MessageBox *msgbox =
      (struct MessageBox *)malloc(sizeof(struct MessageBox));

  msgbox->welcome = welcome_func;
  msgbox->bye = bye_func;

  msgbox->welcome();
  free(msgbox);

  int n = 3, size;
  char *msg;

  while (n--) {
    printf("Size of your message: ");
    size = read_int();

    msg = (char *)malloc(size);

    printf("Message: ");
    read(0, msg, size);

    printf("Saved message: %s\n", msg);

    free(msg);
  }

  msgbox->bye();

  return 0;
}

ci: Lint Souffle warnings

We should ensure new code doesn't add new warnings from Souffle. Souffle doesn't have modular control over which warnings are enabled (and for which parts of the code), and some warnings are "false positives" in the sense that we don't want to or can't fix them. We'll likely need some kind of "golden file" approach.

Move invariant tests from Python to Datalog

There are a few tests written in Python (in test_pointer_analysis_invariants) that would be better expressed as assertions in Datalog. They would be easier to write and faster to evaluate.

docker: Install analysis executables in `dist` image

Currently, only the Fact Generator is packaged in the dist Docker image. We should include executables of the actual analysis!

Provide a Dockerfile

The repo should have a Dockerfile with two targets: dev, which contains all development dependencies, and dist, which has an installed version of cclyzer++.

Ideally, both images would be built and pushed in CI. The dev image would be documented in the developer documentation, the dist image in the getting started documentation (perhaps even a "Quick Start" section which involves pulling the container).

Freeze Python development tools

We use Sphinx for documentation, pytest for tests, and mypy for checking the tests. We should provide a requirements.txt file which indicates known-good versions of all these tools, and install them in the dev image.

Single source of truth for input relations

There are currently three-ish sources of truth for what constitutes an input relation:

The fact generator has a notion of "predicate groups"
The analysis has import-entities.dl and import-predicates.dl
The analysis also has separate declarations for the relations that are imported

This is bad because it requires changes in multiple places to add and remove such predicates, and it's easy to generate more facts in the fact generator than are consumed by the analysis (see e.g. how the CI passed on the first commit of #27, even though it removed the analysis's handling/consumption of some facts).

Both the fact generator and the analysis can use the C pre-processor, so we could easily make an include file that's used by both (LLVM has some examples of this, e.g. the ARM sub-architectures).

ci: Lint with `clang-tidy`

Split from #6

ci: Parallelize pytests

Should use pytest-xdist to parallelize the tests in CI.

Upgrade to the latest LLVM

cclyzer++ has only been tested with LLVM 10. Upgrading would involve:

Reading through the LLVM changelogs for any relevant changes in semantics
Upgrading the FactGenerator and C++ interfaces to the new LLVM API

Single source of truth for output relations

Dually to #37, the C++/LLVM pass and Datalog code separately declare the names and types of the output relations. I'm not as certain that we could easily unify these, but we should look into it.

ci: Run tests

The CI build builds cclyzer++ but doesn't yet run its tests.

ci: Build

The CI system currently builds and publishes the documentation. It should also compile the project.

Make `Instruction` into a union type of opcodes

Similarly to how Type is a union of PrimitiveType, IntegerType, etc., we should make Instruction into a union type of all the known opcodes. This gives stronger type-safety guarantees when a relation has an opcode-specific type.

ci: Attach artifacts to releases

Here's a good example of how to do so: https://github.com/trailofbits/vast/blob/59291522609bbe293b2842c197ab3bf64a16bc1f/.github/workflows/prerelease.yml

Provide executables

The CI system should build and upload static executables as artifacts for ease of installation.

bug: All globals with initializers should have sizes

#95

FactGenerator: Handle `fneg`

The FactGenerator doesn't yet handle fneg, as seen in this MATE issue: GaloisInc/MATE#59

doc: Document installing via Debian packages

The documentation mentions the pre-built Docker images, but not the Debian packages. These are worth a mention!

ci: Lint

The CI system currently builds and publishes the documentation. It should also run clang-tidy and clang-format and report any errors or discrepancies.

docker: Install cclyzer++ from Debian package

Requires #71 and #75. The Docker dist image should install cclyzer++ from the Debian package, rather than manually copying in files from build/. This will help us ensure that the package is well-formed, and reduce duplication.

Missing facts due to not modeling memset

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

typedef struct s {
  char **z;
  int y;
} s;

char glob = 'c';

char **__attribute__((noinline)) get(s *x) {
  printf("Don't inline me, please! %p\n", x);
  return &x[1].z[0];
}

int main() {
  s x[2] = {0};
  char **u = get(x); // points to x[1].z[0]
  return *x[1].z[0];
}

%struct.s = type { i8**, i32 }

@glob = dso_local local_unnamed_addr global i8 99, align 1
@.str = private unnamed_addr constant [29 x i8] c"Don't inline me, please! %p\0A\00", align 1

; Function Attrs: noinline nounwind uwtable
define dso_local i8** @get(%struct.s*) local_unnamed_addr #0 {
  %2 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([29 x i8], [29 x i8]* @.str, i64 0, i64 0), %struct.s* %0)
  %3 = getelementptr inbounds %struct.s, %struct.s* %0, i64 1, i32 0
  %4 = load i8**, i8*** %3, align 8, !tbaa !2
  ret i8** %4
}

; Function Attrs: nounwind
declare dso_local i32 @printf(i8* nocapture readonly, ...) local_unnamed_addr #1

; Function Attrs: nounwind uwtable
define dso_local i32 @main() local_unnamed_addr #2 {
  %1 = alloca [2 x %struct.s], align 16
  %2 = bitcast [2 x %struct.s]* %1 to i8*
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %2) #4
  call void @llvm.memset.p0i8.i64(i8* nonnull align 16 %2, i8 0, i64 32, i1 false)
  %3 = getelementptr inbounds [2 x %struct.s], [2 x %struct.s]* %1, i64 0, i64 0
  %4 = call i8** @get(%struct.s* nonnull %3)
  %5 = getelementptr inbounds [2 x %struct.s], [2 x %struct.s]* %1, i64 0, i64 1, i32 0
  %6 = load i8**, i8*** %5, align 16, !tbaa !2
  %7 = load i8*, i8** %6, align 8, !tbaa !8
  %8 = load i8, i8* %7, align 1, !tbaa !9
  %9 = sext i8 %8 to i32
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %2) #4
  ret i32 %9
}

Both @get:%4 and @main:%4 should point to x[1].z[0] (i.e., *null*), but they don't point to anything:

[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1]	[<<main-context>>, nil]	</mate/out.ll>:get:%0
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][0]	[<<main-context>>, nil]	</mate/out.ll>:get:%0
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][0][0]	[<<main-context>>, nil]	</mate/out.ll>:get:%0
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1]	[<<main-context>>, nil]	</mate/out.ll>:main:%1
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][1]	[<<main-context>>, nil]	</mate/out.ll>:get:%3
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][1].?/0	[<<main-context>>, nil]	</mate/out.ll>:get:%3
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][0][1]	[<<main-context>>, nil]	</mate/out.ll>:get:%3
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][0][1].?/0	[<<main-context>>, nil]	</mate/out.ll>:get:%3
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][1]	[<<main-context>>, nil]	</mate/out.ll>:main:%5
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][1].?/0	[<<main-context>>, nil]	</mate/out.ll>:main:%5
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][0][1]	[<<main-context>>, nil]	</mate/out.ll>:main:%5
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][0][1].?/0	[<<main-context>>, nil]	</mate/out.ll>:main:%5
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1]	[<<main-context>>, nil]	</mate/out.ll>:main:%3
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][0]	[<<main-context>>, nil]	</mate/out.ll>:main:%3
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1][0][0]	[<<main-context>>, nil]	</mate/out.ll>:main:%3
[<<main-context>>, nil]	*stack_alloc@main[[2 x %struct.s]* %1]	[<<main-context>>, nil]	</mate/out.ll>:main:%2

Correctness tests

We currently have three kinds of tests:

Python correctness tests like callgraph_test that read Datalog relations into Python and make assertions about them
Property tests like relation_property_test that assert general features of the output of the analysis
Assertions in the Datalog code which are checked during property tests

Only the first type checks (and documents) the intended behavior of specific groups of rules in the analysis, and we have really limited numbers of those tests. For rules that contribute somewhat indirectly or infrequently to the analysis output, it's hard to say what they're intended to do or if they're doing it correctly (e.g., the rules dealing with pass-by-value semantics).

We should expand the number of these kinds of tests. However, it's kind of a pain to write them in Python (it's very "imperative", rather than "decarative"). I'm not sure what the best solution is, but I think perhaps lit and FileCheck could be useful.

ci: Lint with `clang-format`

Split from #6

datalog: Don't instantiate C++ components

There are some C++-related components/relations like ThrowInstruction that don't meaningfully contribute to the core pointer analysis relations. To speed up the analysis, we should not instantiate them in the analysis. However, we can instantiate them in debug builds to keep them from bitrotting.

cclyzerpp/datalog/points-to/cplusplus-exceptions.dl

Line 160 in 14063ef

.comp ThrowInstruction {

ci: Lint with Mypy

Mypy should be run on the Python-based test suite in CI.

doc: Change Sphinx theme

The current (default) Sphinx theme has some issues:

The sidebar doesn't adequately separate different ToCs
It displays poorly on mobile

We should use a different theme, perhaps the one that MATE does.

LLVM type relations are both inputs and computed

The relations primitive_type, integer_type, fp_type, etc. are generated by the FactGenerator, but they also have rules in the Datalog code. This situation should at the very least be documented, but possibly changed so that only one of the two is responsible for adding facts to these relations.

Advantages to doing it in Datalog:

We know statically that there will be at least a few primitives types: i1, float, etc. Perhaps Souffle and Clang can take advantage of this when these are stated as facts and generate better code.

Advantages of doing it in the FactGenerator:

We account for only and exactly the types that actually appear in the module at hand
Less Datalog code leads to smaller synthesized C++ code, which means lower compile times

Also, I think this might be tied up with a difference in the output of cclyzer++ in the ntu-uaf.c test case, but I'm not sure exactly how.

Semantic versioning

We should decide and document what the public API of cclyzer++ is. It probably contains at least the filenames that are ouput by the non-debug .project files, and might include the C++ API for consuming those files.

After documenting this, we should commit to semantic versioning with respect to it, since this is a widely-used versioning scheme that will help consumers understand and effectively use different versions of cclyzer++.

Migrate library function signatures from MATE

Historically, cclyzer++ was part of MATE. MATE includes signatures for a lot of libc and other C library functions that we should import to this repo. The signatures live here: https://github.com/GaloisInc/MATE/blob/5abdf0953cb4b1a9eecc40db7930cb5b694ef23d/default-signatures.yml

Determine how long Docker images are kept

The CI system now builds and pushes two Docker images for each commit on a PR or to main. This is convenient, but this convenience should be balanced against cost of storage. We should determine how long these images are stored, and possibly revise when they are pushed to GHCR.

cmake: Fact generator sources are listed twice

... once for the factgen-exe target, once for SoufflePA. Lift to a CMake variable and deduplicate.

ci: Build FactGenerator with sanitizers

UBSan in particular is low-overhead, it's almost certainly worth using it during testing.

Might be worth looking here: https://github.com/trailofbits/vast/blob/master/cmake/sanitizers.cmake

Arrays may be indexed as pointers

Consider this C program:

char get4(char *buf) { return buf[4]; }

void indirect() {
  char tp_indir_buf[8];
  printf("c = '%c'\n", get4(tp_indir_buf));
}

which produces this LLVM program:

; Function Attrs: noinline nounwind optnone uwtable
define dso_local signext i8 @get4(i8*) #0 {
  %2 = alloca i8*, align 8
  store i8* %0, i8** %2, align 8
  %3 = load i8*, i8** %2, align 8
  %4 = getelementptr inbounds i8, i8* %3, i64 4
  %5 = load i8, i8* %4, align 1
  ret i8 %5
}

; Function Attrs: noinline nounwind optnone uwtable
define dso_local void @indirect() #0 {
  %1 = alloca [8 x i8], align 1
  %2 = getelementptr inbounds [8 x i8], [8 x i8]* %1, i32 0, i32 0
  %3 = call signext i8 @get4(i8* %2)
  %4 = sext i8 %3 to i32
  %5 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str, i32 0, i32 0), i32 %4)
  ret void
}

The array-typed allocation indirect:%1 will be indexed by the getelementptr in get4 as if it were a pointer. However, our rules for deciding which array suballocations to create don't account for this kind of indexing:

cclyzerpp/datalog/points-to/allocations-subobjects.dl

Lines 51 to 121 in 2141e38

    
           pointer_index(?region, ?type, ?finalIdx) :- 
        
             ( ( getelementptr_instruction_index(?insn, 0, ?indexOp) 
        
               , constant_to_int(?indexOp, ?index) 
        
               , getelementptr_instruction_base_type(?insn, ?declaredType) 
        
               , _alloc_region(?region) 
        
               ) 
        
             ; ( getelementptr_constant_expression_index(?expr, 0, ?indexOp) 
        
               , constant_to_int(?indexOp, ?index) 
        
               , _getelementptr_constant_expression_base_type(?expr, ?declaredType) 
        
               , global_region(?region) 
        
               ) 
        
             ), 
        
             ?index >= 0, 
        
             // The first index of a GEP always a pointer type. 
        
             pointer_type_has_component(?declaredType, ?declaredElemType), 
        
             ( ( type_compatible(?type, ?declaredElemType) 
        
               , type_has_size(?type, ?size) 
        
               ) 
        
             ; // This case is not redundant with the previous in the case; there may be 
        
               // types ?type where there is no pointer_type_has_component(_, ?type). 
        
               // Conversely, there may be types that are not compatible, but such that 
        
               // pointer types to them are compatible (i8 and i64 vs. i8* and i64*). 
        
               ( pointer_type_has_component(?ptrType, ?type) 
        
               , type_compatible(?ptrType, ?declaredType) 
        
               , type_has_size(?type, ?size) 
        
               ) 
        
             ), 
        
             // Adjust the index... (this calculation should match what's done for GEP) 
        
             ?size > 0, 
        
             type_has_size(?declaredElemType, ?declaredSize), 
        
             // Soufflé can still divide by zero here if it reorders clauses, so we take max 
        
             ?finalIdx = (?index * ?declaredSize) / max(?size, 1), 
        
             // Check that the type sizes divide evenly... 
        
             (?finalIdx * ?size) = (?index * ?declaredSize). 
        
           .decl pointer_offset(?region: Region, ?type: Type, ?offset: SubregionOffset) 
        
           // Still need [0] for unsized types, e.g., opaque structs. 
        
           pointer_offset(?region, ?type, 0) :- 
        
             pointer_index(?region, ?type, 0). 
        
           pointer_offset(?region, ?type, ?index * ?size) :- 
        
             pointer_index(?region, ?type, ?index), 
        
             type_has_size(?type, ?size). 
        
           // Find all statically possible array indices 
        
           .decl array_indices__no_typecomp(?region: Region, ?type: ArrayType, ?index: ArrayIndex) 
        
           array_indices__no_typecomp(?region, ?declaredType, as(?constantIndex, ArrayIndex)) :- 
        
              ( ( getelementptr_instruction_index(?insn, ?index, ?indexOp) 
        
                , getelementptr_instruction_interm_type(?insn, ?index, ?declaredType) 
        
                , _alloc_region(?region) 
        
                ) 
        
              ; ( getelementptr_constant_expression_index(?expr, ?index, ?indexOp) 
        
                , _getelementptr_constant_expression_interm_type(?expr, ?index, ?declaredType) 
        
                , global_region(?region) 
        
                ) 
        
              ), 
        
              constant_to_int(?indexOp, ?constantIndex), 
        
              array_type(?declaredType). 
        
           // Same thing, but also consider compatible array types 
        
           .decl array_indices(?region: Region, ?type: ArrayType, ?index: ArrayIndex) 
        
           array_indices(?region, ?type, ?constantIndex) :- 
        
             array_type(?type), 
        
             type_contains_pointer(?type), 
        
             type_compatible(?type, ?declaredType), 
        
             array_indices__no_typecomp(?region, ?declaredType, ?constantIndex).

In particular, pointer_index is separate from array_index, so array allocations don't have numbered suballocations at indices that are used to index into pointers. This is an issue for precision, because the GEP rules will produce points-to facts for the imprecise [*]-suballocation whereas they could in principle use suballocations at particular indices. The fix would be to add indices to array_index that correspond to pointer-indexing GEPs at a compatible type:

diff --git a/datalog/points-to/allocations-subobjects.dl b/datalog/points-to/allocations-subobjects.dl
index 2f58634..49f3163 100644
--- a/datalog/points-to/allocations-subobjects.dl
+++ b/datalog/points-to/allocations-subobjects.dl
@@ -41,14 +41,18 @@ _alloc_region(?r) :- global_region(?r).
 _alloc_region(?r) :- heap_region(?r).
 _alloc_region(?r) :- stack_region(?r).
 
+.decl _array_or_pointer_index(?region: Region, ?type: Type, ?index: ArrayIndex)
 .decl pointer_index(?region: Region, ?type: Type, ?index: ArrayIndex)
 
 pointer_index(?region, ?type, 0) :-
   _alloc_region(?region),
   type(?type).
 
+pointer_index(?region, ?type, ?idx) :-
+  _array_or_pointer_index(?region, ?type, ?idx).
+
 // TODO(lb): Use `gep_zero_index_offset` here.
-pointer_index(?region, ?type, ?finalIdx) :-
+_array_or_pointer_index(?region, ?type, ?finalIdx) :-
   ( ( getelementptr_instruction_index(?insn, 0, ?indexOp)
     , constant_to_int(?indexOp, ?index)
     , getelementptr_instruction_base_type(?insn, ?declaredType)
@@ -120,6 +124,11 @@ array_indices(?region, ?type, ?constantIndex) :-
   type_compatible(?type, ?declaredType),
   array_indices__no_typecomp(?region, ?declaredType, ?constantIndex).
 
+// Arrays may be indexed as pointers
+array_indices(?region, ?type, ?index) :-
+  array_type_has_component(?type, ?elemType),
+  _array_or_pointer_index(?region, ?elemType, ?index).
+
 .decl array_offset(?region: Region, ?type: Type, ?offset: SubregionOffset)
 array_offset(?region, ?type, ?index * ?size) :-
   array_indices(?region, ?type, ?index),
@@ -212,7 +221,7 @@ not_array_index(?component) :- path_component_at_any_index(?component).
 .type ArraySubregion
   = ArrayIndexSubregion
   | AnyArrayIndexSubregion
-  
+
 .type ArrayIndexSubregion <: symbol
 .type AnyArrayIndexSubregion <: symbol

	.decl _typeinfo_by_class_name(?className: symbol , ?typeInfo: GlobalVariable)
	_typeinfo_by_class_name(?className, ?typeInfo) :-
	_typeinfo_class_name(?typeInfo, ?className).

	pointer_index(?region, ?type, ?finalIdx) :-
	( ( getelementptr_instruction_index(?insn, 0, ?indexOp)
	, constant_to_int(?indexOp, ?index)
	, getelementptr_instruction_base_type(?insn, ?declaredType)
	, _alloc_region(?region)
	)
	; ( getelementptr_constant_expression_index(?expr, 0, ?indexOp)
	, constant_to_int(?indexOp, ?index)
	, _getelementptr_constant_expression_base_type(?expr, ?declaredType)
	, global_region(?region)
	)
	),
	?index >= 0,
	// The first index of a GEP always a pointer type.
	pointer_type_has_component(?declaredType, ?declaredElemType),
	( ( type_compatible(?type, ?declaredElemType)
	, type_has_size(?type, ?size)
	)
	; // This case is not redundant with the previous in the case; there may be
	// types ?type where there is no pointer_type_has_component(_, ?type).
	// Conversely, there may be types that are not compatible, but such that
	// pointer types to them are compatible (i8 and i64 vs. i8* and i64*).
	( pointer_type_has_component(?ptrType, ?type)
	, type_compatible(?ptrType, ?declaredType)
	, type_has_size(?type, ?size)
	)
	),
	// Adjust the index... (this calculation should match what's done for GEP)
	?size > 0,
	type_has_size(?declaredElemType, ?declaredSize),
	// Soufflé can still divide by zero here if it reorders clauses, so we take max
	?finalIdx = (?index * ?declaredSize) / max(?size, 1),
	// Check that the type sizes divide evenly...
	(?finalIdx * ?size) = (?index * ?declaredSize).

	.decl pointer_offset(?region: Region, ?type: Type, ?offset: SubregionOffset)

	// Still need [0] for unsized types, e.g., opaque structs.
	pointer_offset(?region, ?type, 0) :-
	pointer_index(?region, ?type, 0).

	pointer_offset(?region, ?type, ?index * ?size) :-
	pointer_index(?region, ?type, ?index),
	type_has_size(?type, ?size).

	// Find all statically possible array indices

	.decl array_indices__no_typecomp(?region: Region, ?type: ArrayType, ?index: ArrayIndex)

	array_indices__no_typecomp(?region, ?declaredType, as(?constantIndex, ArrayIndex)) :-
	( ( getelementptr_instruction_index(?insn, ?index, ?indexOp)
	, getelementptr_instruction_interm_type(?insn, ?index, ?declaredType)
	, _alloc_region(?region)
	)
	; ( getelementptr_constant_expression_index(?expr, ?index, ?indexOp)
	, _getelementptr_constant_expression_interm_type(?expr, ?index, ?declaredType)
	, global_region(?region)
	)
	),
	constant_to_int(?indexOp, ?constantIndex),
	array_type(?declaredType).

	// Same thing, but also consider compatible array types

	.decl array_indices(?region: Region, ?type: ArrayType, ?index: ArrayIndex)

	array_indices(?region, ?type, ?constantIndex) :-
	array_type(?type),
	type_contains_pointer(?type),
	type_compatible(?type, ?declaredType),
	array_indices__no_typecomp(?region, ?declaredType, ?constantIndex).