Reverse Optimization Binary Transformation Anti-Plagiarism System
This is a repo with all the work related to my thesis defense, into the field of Computer Science.
The significant increase in the use of remote programming education platforms since the COVID-19 crisis, coupled with recent advances in the field of generative Artificial Intelligence (AI), has brought new challenges to the way source codes are typically evaluated for plagiarism. The effort required to commit plagiarism has decreased, whether by using automated syntactic transformation techniques or by generating alternative versions of the source code through generative AI. Most of the current approaches used in the construction of anti-plagiarism tools are not robust enough to deal with the new scope of such technological advances. Existing works in the literature on code cloning detectors have demonstrated that there are syntactic transformations that change the code in such a way that the syntactic/semantic representation algorithms used in plagiarism assessment (e.g., AST—abstract syntax tree; CFG—control flow graph; PDG—program dependency graph; and Tokenization) do not recognize the altered code as a possible clone. Currently, the approach of binary code optimization is already recognized as an obfuscation technique and manipulation of malicious codes in the area of antivirus evasion, but little attention has been given to its potential in the construction of anti-plagiarism tools. In this work, we propose a method for plagiarism detection that extends the capabilities demonstrated by existing approaches. Our method consists of a technique that not only analyzes source codes, or just the corresponding binary files resulting from compilation but also performs an analysis of the binary resulting from the compilation with optimizations. Intuitively, the use of compilation optimizations serves as a "reverse filter," removing from the binary code possible sources of obfuscation implemented by the plagiarist in their modification of an original source code. Our experiments confirm this correlation between optimized binary code and its original source code. The empirical evaluation of our method demonstrates a use for calculating levels of similarity between binary codes, having good efficiency in detecting plagiarism even when the corresponding source codes have undergone complex syntactic changes. Thus, we believe that the method proposed here can be effectively used as a complementary tool to existing ones for conducting plagiarism analysis.